Vehicle control method and apparatus, model training method, and device and medium
By using image sensors and gesture recognition models on the vehicle, the driver's hand gestures are captured and recognized, enabling vehicle control. This solves the problem of insufficient control buttons on the vehicle handlebars and provides richer operation and interaction methods.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NEXTVPU (SHANGHAI) CO LTD
- Filing Date
- 2025-12-04
- Publication Date
- 2026-06-25
AI Technical Summary
Vehicle handlebars have limited control buttons, necessitating richer and faster operation and interaction methods.
By capturing the driver's hand gestures using image sensors on the vehicle and comparing them with a pre-trained gesture recognition model, vehicle control can be achieved.
It provides a variety of convenient vehicle control methods, solves the problem of insufficient handle control buttons, is suitable for end-side deployment, and improves the driver's operating experience.
Smart Images

Figure CN2025140058_25062026_PF_FP_ABST
Abstract
Description
Vehicle control methods, model training methods, devices, equipment and media
[0001] This application claims priority to Chinese Patent Application No. 202411864781.9, filed with the Chinese Patent Office on December 17, 2024, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of computer vision technology, such as a vehicle control method, model training method, device, equipment and medium. Background Technology
[0003] To make it easier for drivers to control two-wheeled vehicles and other similar vehicles, various control buttons are installed on the handlebars. However, the number of handlebar control buttons is limited, and there is a need for more diverse and convenient operation and interaction methods.
[0004] The methods described in this section are not necessarily methods that had been previously conceived or adopted. Unless otherwise specified, no method described in this section should be assumed to be prior art simply because it is included in this section. Similarly, unless otherwise specified, the issues mentioned in this section should not be considered to be accepted in any prior art. Summary of the Invention
[0005] This application provides a vehicle control method, including:
[0006] The vehicle captures images of the surrounding area to be identified using image sensors on the vehicle.
[0007] The image to be recognized is input into a pre-trained gesture recognition model to obtain the gesture actions performed by the driver;
[0008] The driver's hand gestures are compared with predefined reference gestures. When the hand gestures match, the vehicle is controlled according to the corresponding operation instructions of the reference gestures.
[0009] In some embodiments, the vehicle control method further includes:
[0010] The image to be recognized is preprocessed to convert it into a pre-defined format.
[0011] In some embodiments, the vehicle is a vehicle controlled by a steering wheel.
[0012] In some embodiments, the image sensor is mounted on the vehicle's dashboard or in an area of the vehicle at a preset distance from the dashboard.
[0013] In some embodiments, the operation instructions are non-safety operation instructions.
[0014] In some embodiments, the vehicle is a two-wheeled vehicle.
[0015] In some embodiments, the gesture recognition model is a neural network model trained using the following method:
[0016] Obtain the gesture action detection dataset, which includes multiple image samples of the vehicle while it is stationary or moving.
[0017] For each image sample in a plurality of image samples:
[0018] Extract edge and background information from the image sample;
[0019] Perform multi-scale forward augmentation on the image sample; and
[0020] Identify one or more hand gestures in the image sample; and
[0021] A neural network model is trained based on edge information, background information, and one or more recognized gestures to establish a gesture recognition model.
[0022] In some embodiments, for each image sample among a plurality of image samples, extracting the edge information and background information of that image sample includes:
[0023] Perform global average pooling and global max pooling operations on each of the multiple image samples to obtain the edge and background information of that image sample.
[0024] In some embodiments, the vehicle control method further includes, for the identified one or more gesture actions, determining a loss function between the predicted bounding boxes and labeled bounding boxes of one or more gesture actions;
[0025] The neural network model is trained based on edge information, background information, and one or more recognized gestures to establish a gesture recognition model, including:
[0026] A neural network model is trained based on edge information, background information, and a loss function to establish a gesture recognition model.
[0027] In some embodiments, the loss function is the Inner Complete Intersection over Union (CIoU) loss function.
[0028] In some embodiments, the vehicle control method further includes determining the ReLU function of the linear rectifier unit as the activation function of the neural network model;
[0029] The neural network model is trained based on edge information, background information, and one or more recognized gestures to establish a gesture recognition model, including:
[0030] A neural network model is trained based on edge information, background information, and activation functions to establish a gesture recognition model.
[0031] In some embodiments, training a neural network model based on edge information, background information, and one or more recognized gestures to establish a gesture recognition model includes:
[0032] Perform redundancy removal on the neural network model.
[0033] In some embodiments, the vehicle control method further includes:
[0034] After establishing the gesture recognition model, graph optimization operations are performed on the convolutional layers, batch normalization layers, and activation functions of the gesture recognition model.
[0035] In some embodiments, performing graph optimization operations on the convolutional layers, batch normalization layers, and activation functions of the gesture recognition model includes:
[0036] The gesture recognition model is reparameterized to couple multiple operators of the corresponding convolutional layer, batch normalization layer, and activation function of the gesture recognition model.
[0037] This application provides a method for training a gesture recognition model, including:
[0038] Obtain the gesture action detection dataset, which includes multiple image samples of the vehicle while it is stationary or moving.
[0039] For each image sample in a plurality of image samples:
[0040] Extract edge and background information from the image sample;
[0041] Perform multi-scale forward augmentation on the image sample; and
[0042] Identify one or more hand gestures in the image sample; and
[0043] A neural network model is trained based on edge information, background information, and one or more recognized gestures to establish a gesture recognition model.
[0044] This application provides a vehicle control device, including:
[0045] An image capture unit is configured to capture images of the area around the vehicle to be identified via an image sensor on the vehicle.
[0046] A gesture recognition unit is configured to recognize an image to be recognized using a pre-trained gesture recognition model to obtain the gesture actions performed by the driver; and
[0047] The control unit is configured to compare the hand gestures performed by the driver with predefined reference gestures. When the hand gestures match, the control unit controls the vehicle according to the corresponding operation command of the reference gesture.
[0048] This application provides an electronic device, including:
[0049] At least one processor; and
[0050] A memory that is communicatively connected to at least one processor; wherein,
[0051] The memory stores a computer program that can be executed by at least one processor, such that the at least one processor is able to execute a vehicle control method or a model training method for a gesture recognition model.
[0052] This application provides a computer-readable storage medium storing computer instructions that are used to cause a processor to execute a vehicle control method or a gesture recognition model training method.
[0053] This application provides a computer program product, including a computer program, which, when executed by a processor, implements a vehicle control method or a gesture recognition model training method. Attached Figure Description
[0054] Figure 1 is a flowchart of the vehicle control method of this embodiment;
[0055] Figure 2 is a flowchart of the model training method for the gesture recognition model in this embodiment;
[0056] Figure 3 is a flowchart of performing global average pooling and global max pooling operations on an image according to an exemplary embodiment;
[0057] Figure 4 is a flowchart of performing multi-scale forward enhancement processing on an image according to an exemplary embodiment;
[0058] Figure 5 is a structural block diagram of the vehicle control device in this embodiment;
[0059] Figure 6 is a schematic diagram of the structure of an electronic device that can be used to implement an embodiment of this application. Detailed Implementation
[0060] In this application, the terms “comprising” and “having” and any variations thereof are intended to cover non-exclusive inclusion, such as including, in addition to the series of steps or units shown in the embodiments of this application, a process, method, system, product or device may also include processes, methods, systems, products or devices that do not explicitly list such series of steps or units, or other steps or units inherent to such processes, methods, systems, products or devices.
[0061] Example 1
[0062] Figure 1 shows a flowchart of the vehicle control method 100 of this embodiment. As shown in Figure 1, the vehicle control method 100 may include: step S110, capturing an image to be recognized around the vehicle via an image sensor on the vehicle; step S120, inputting the image to be recognized into a pre-trained gesture recognition model to obtain the gesture action performed by the driver; and step S130, comparing the gesture action performed by the driver with a predefined reference gesture, and when the gesture comparison result is consistent, controlling the vehicle according to the operation command corresponding to the reference gesture.
[0063] In step S110, the image sensor can be any suitable sensor capable of capturing images, such as a camera, video camera, or webcam. The image sensor can be located anywhere at the front of the vehicle, as long as it can capture the driver's image in real time. For example, it can be installed near or on the vehicle's dashboard, or it can be installed along the vehicle's centerline. Exemplarily, the image sensor is a fisheye lens facing the driver's upper body to track and detect the driver's limbs and gestures.
[0064] In some implementations, the vehicle is a handlebar-controlled vehicle, such as a bicycle, a two-wheeled motorcycle, a two-wheeled electric bicycle, or a three-wheeled or four-wheeled vehicle with other handlebar controls.
[0065] In step S120, the gesture recognition model trained by the model training method of the gesture recognition model in implementation two can be used to obtain the gesture actions performed by the driver, such as sliding the palm left and right, or moving the palm vertically closer and further away.
[0066] In some embodiments, the vehicle control method 100 may further include: preprocessing the image to be recognized to convert it into a preset format. For example, the image to be recognized may not be directly applicable as input to the gesture recognition model. Therefore, the image to be recognized can be converted into a preset format, such as converting it to YUV format via a video input module, or converting it to a format suitable for the neural network model, so that it can be applied to the gesture recognition model to achieve fast and accurate detection of gestures.
[0067] In step S130, the driver's hand gesture is compared with a predefined reference gesture. When the gesture comparison results match, the vehicle is controlled according to the operation command corresponding to the reference gesture. For example, if the driver's palm is detected to slide to the left or right, an operation command to play the previous or next song is issued to the vehicle.
[0068] In some situations, controlling a vehicle via gestures may not be as timely or accurate as using the function buttons on the handlebars. In such cases, gestures are suitable for triggering operational commands that the function buttons on the handlebars do not have. For example, the function buttons on the handlebars perform safety operations such as braking, while gestures trigger non-safety operations such as answering a phone call or playing music.
[0069] In some embodiments, non-safety operation commands may be vehicle assistance function operation commands that do not directly interfere with core parameters of vehicle driving safety (such as power output, braking performance, steering control, driving stability, etc.) and are only used to optimize the driving experience.
[0070] In some embodiments, the vehicle control method 100 is suitable for end-side deployment, which can solve the problem of insufficient vehicle function buttons and provide drivers with rich and convenient vehicle control methods.
[0071] Example 2
[0072] Figure 2 shows a flowchart of the model training method 200 for the gesture recognition model in this embodiment. As shown in Figure 2, the model training method 200 for the gesture recognition model may include: step S210, acquiring a gesture action detection dataset, which includes multiple image samples of the vehicle during stationary or moving periods; step S220, extracting edge information and background information for each image sample from the multiple image samples; step S230, performing multi-scale forward augmentation processing on each image sample from the multiple image samples; step S240, recognizing one or more gesture actions in each image sample from the multiple image samples; and step S250, training a neural network model based on the edge information, background information, and the recognized one or more gesture actions to establish a gesture recognition model.
[0073] The aforementioned neural network model can be the 8th generation version of the YOLO series algorithm (YOLOv8) or other models based on deep neural network algorithms, in order to perform detection on each image.
[0074] Since edge information can locate the outline boundary of the target object, while background information can determine the environmental range of the target object, extracting both edge and background information simultaneously to train the neural network model can accurately determine the location in more complex scenes, reduce false detections and false negatives, and better infer the complete shape and position of the occluded part when the target object is partially occluded, thereby improving the accuracy of the established gesture recognition model.
[0075] Because the acquired gesture detection dataset contains a large number of detection images from different angles, times, regions, and weather conditions, and the number of targets of different categories is relatively balanced, the diversity and richness of the detection dataset can be ensured, which is conducive to improving the accuracy of the established gesture recognition model and adapting it to diverse application scenarios.
[0076] In step S220, obtaining the edge information and background information of each image sample among the multiple image samples may include performing a global average pooling operation and a global max pooling operation on each image sample among the multiple image samples to obtain the edge information and background information of the image sample.
[0077] In global average pooling, the entire feature map is taken as input, and a new feature vector is generated by averaging it across its spatial dimensions (height and width). For example, suppose the feature map is of size H×W×C, where H represents the height, W represents the width, and C represents the number of channels. For each channel c (c=1, 2, ..., C), the average of all elements in that channel is calculated. After global average pooling, a feature vector of size 1×1×C is obtained. This feature vector compresses the original feature map in terms of height and width, retaining only the channel dimension information, and the value for each channel is the average of all elements in the corresponding channel of the original feature map.
[0078] By applying global average pooling, the number of connection weights in the fully connected layers of the neural network is reduced. This reduction in parameters helps decrease the complexity of the neural network, decreases the risk of overfitting, and thus enables the neural network model to generalize better on unseen data.
[0079] Similar to global average pooling, global max pooling takes the entire feature map as input and maximizes it in its spatial dimensions (height and width) to generate a new feature vector. For example, suppose the feature map is of size H×W×C, where H represents the height, W represents the width, and C represents the number of channels. For each channel c (c=1, 2, ..., C), the maximum value of all elements in that channel is determined. After global max pooling, a feature vector of size 1×1×C is obtained. This feature vector compresses the original feature map in both height and width, retaining only the channel dimension information, and the value for each channel is the maximum value among all elements of the corresponding channel in the original feature map.
[0080] By applying global max pooling, the most salient features in each channel can be extracted. Since these maximum values may represent key information in the image—for example, in object detection tasks, the most representative features of an object (edges, textures, etc.) may exist in the feature map as maximum values—global max pooling can help the model quickly find this key information, improving the accuracy and efficiency of object detection.
[0081] Figure 3 illustrates a flowchart of performing global average pooling and global max pooling operations on an image according to an exemplary embodiment. As shown in Figure 3, by adding operations 310 (Max Pooling 2-dimensional (MaxPool2d)) and 320 (Average Pooling 2-dimensional (AvgPool2d)), edge and background information of image samples can be extracted more efficiently, thereby improving the expressive power of the neural network model.
[0082] In step S230, the multi-scale forward enhancement (MSFE) process can make the neural network model suitable for small target object detection, thereby improving the accuracy of gesture recognition.
[0083] As shown in Figure 4, the MSFE process includes performing dilated convolutions with multiple dilation rates on the original feature map before it is input to the detection module (Detect) to obtain multiple dilated feature maps; fusing the original feature map with each dilated feature map to obtain an enhanced feature map; and replacing the original feature map with the enhanced feature map and inputting it into Detect.
[0084] In some implementations, dilated convolution can be performed by adding spaces (zeros) between kernel elements to enlarge the kernel before operation. The dilation rate is used to define the data spacing during dilated convolution. For example, in this embodiment, the dilation rate can be selected as 2 or 3, as shown in Figure 4. The original feature map (which can be understood as having a dilation rate of 1), the dilated feature map with a dilation rate of 2, and the dilated feature map with a dilation rate of 3 are concatenated (Concat), and then convolved (Conv) before being input into Detect. MSFE processing is added before Detect1, Detect2, and Detect3.
[0085] In some embodiments, the model training method 200 for the gesture recognition model may further include step S260: determining a loss function between the predicted bounding box and the labeled bounding box of the recognized gesture action. Step S250: training the neural network model based on edge information, background information, and the recognized gesture action to establish the gesture recognition model may include: training the neural network model based on edge information, background information, and the loss function to establish the gesture recognition model.
[0086] The loss function can characterize the difference between the predicted bounding box and the labeled bounding box of a gesture action. By calculating this difference and training a neural network model based on it, the accuracy of the model can be improved.
[0087] According to some embodiments of this disclosure, the loss function can be the Inner Complete Intersection over Union (Inner CIoU) loss function, which measures the difference between the predicted bounding boxes of the detected objects and the labeled (grounded) bounding boxes. The expression for Inner CIoU is: Intersection over Union (IoU) is the ratio of the intersection to the union of the predicted bounding box of an object and the labeled (true) bounding box. α represents the weighting factor that balances different loss terms (such as the distance between center points and the difference in aspect ratio). For example, α can take the value 0.5.
[0088] In some implementations, the loss function can also be other suitable loss functions.
[0089] In some implementations, the model training method 200 for the gesture recognition model may further include: determining the Rectified Linear Unit (ReLU) function as the activation function of the neural network model; and wherein training the neural network model based on edge information, background information, and the recognized gesture action to establish the gesture recognition model may include: training the neural network model based on edge information, background information, and the activation function to establish the gesture recognition model.
[0090] In some implementations, the model training method 200 for the gesture recognition model may further include performing a redundancy removal operation on the neural network model.
[0091] When using the ReLU activation function to train a neural network model, some neural network operation units may output 0, resulting in sparsity to some extent. Furthermore, for convolutional neural network models, the number of channels in the convolutional kernel is usually redundant. Since a batch normalization layer is typically followed by a regular convolutional layer, this layer can be used to remove redundant channels, thereby reducing the number of computations and the complexity of the neural network model, speeding up computation, and allowing the neural network model to focus more on the key features of the data, thus reducing the fitting of irrelevant information.
[0092] After establishing the gesture recognition model, it can be trained and fitted, and finally converted into an offline model in Open Neural Network Exchange (ONNX) format to decouple it from the original deep learning framework. This allows it to run on various operating systems and hardware architectures without relying on the training framework. After obtaining the corresponding ONNX offline model, the Distribution Focal Loss (DFL) module can be pruned to avoid operations such as reshaping, transposing, and using the Softmax function, thereby simplifying the operation and improving the model's computation speed.
[0093] In some implementations, the model training method 200 for the gesture recognition model may further include performing graph optimization operations on the convolutional layers, batch normalization layers, and activation functions of the gesture recognition model.
[0094] Performing graph optimization operations may include reparameterizing the gesture recognition model to couple multiple operators of the corresponding convolutional layers, batch normalization layers, and activation functions of the gesture recognition model.
[0095] Example 3
[0096] Figure 5 shows a structural block diagram of the vehicle control device 500 of this embodiment. As shown in Figure 5, the vehicle control device 500 may include: an image capture unit 510, configured to capture images of the surrounding area of the vehicle to be recognized via an image sensor on the vehicle; a gesture recognition unit 520, configured to recognize the images to be recognized using a pre-trained gesture recognition model to obtain the gesture actions performed by the driver; and a control unit 530, configured to compare the gesture actions performed by the driver with predefined reference gestures, and when the gesture comparison results match, control the vehicle according to the operation instructions corresponding to the reference gestures.
[0097] In some embodiments, the vehicle control device 500 may further include a preprocessing unit configured to preprocess the image to be recognized in order to convert the image to be recognized into an image to be recognized in a preset format.
[0098] The various units of the vehicle control device 500 shown in Figure 5 correspond to the various steps in the vehicle control method 100 described with reference to Figure 1. Therefore, the operations, features, and advantages described for the vehicle control method 100 also apply to the vehicle control device 500 and its constituent units. For the sake of brevity, some operations, features, and advantages will not be repeated here.
[0099] Example 4
[0100] Figure 6 shows a schematic diagram of the structure of an electronic device that can be used to implement embodiments of this application. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of this application described and / or claimed herein.
[0101] As shown in Figure 6, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 can also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0102] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0103] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, digital signal processors (DSPs), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods described above, such as vehicle control methods and model training methods for gesture recognition models.
[0104] In some embodiments, the vehicle control method and the gesture recognition model training method described above can be implemented as computer programs, which are tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, the vehicle control method and the gesture recognition model training method described above can be executed. Alternatively, in other embodiments, processor 11 can be configured to execute the vehicle control method and the gesture recognition model training method by any other suitable means (e.g., by means of firmware).
[0105] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard parts (ASSPs), system-on-chips (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0106] Computer programs used to implement the methods of this application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0107] In the context of this application, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. Examples of machine-readable storage media may include electrical connections based on one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, compact disc-read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0108] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device for displaying information to a user; and a keyboard and pointing device through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including voice input, speech input, or tactile input).
[0109] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0110] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system. It addresses the shortcomings of traditional physical hosts and Virtual Private Server (VPS) services, such as high management difficulty and weak business scalability.
[0111] It should be understood that the various processes shown above can be used to reorder, add, or delete steps. For example, the multiple steps described in this application can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this application can be achieved.
Claims
1. A vehicle control method, comprising: The image sensor on the vehicle captures an image of the area around the vehicle to be identified; The image to be recognized is input into a pre-trained gesture recognition model to obtain the gesture actions performed by the driver; The driver's hand gestures are compared with predefined reference gestures. When the hand gestures match, the vehicle is controlled according to the corresponding operation instructions of the reference gestures.
2. The vehicle control method according to claim 1, further comprising: The image to be identified is preprocessed to convert it into an image to be identified in a preset format.
3. The vehicle control method according to claim 1, wherein, The vehicle in question is a vehicle controlled by a steering wheel.
4. The vehicle control method according to claim 1, wherein, The image sensor is installed on the vehicle's dashboard or in an area on the vehicle at a preset distance from the dashboard.
5. The vehicle control method according to claim 1, wherein, The operation instructions are non-safety operation instructions.
6. The vehicle control method according to claim 1, wherein, The vehicle in question is a two-wheeled vehicle.
7. The vehicle control method according to claim 1, wherein, The gesture recognition model is a neural network model trained using the following method: Obtain a gesture detection dataset, which includes multiple image samples of the vehicle while it is stopped or moving. For each of the plurality of image samples: Extract edge and background information from the image sample; Perform multi-scale forward augmentation on the image sample; and Identify one or more hand gestures in the image sample; as well as Based on the edge information, the background information, and the recognized one or more gestures, a neural network model is trained to establish a gesture recognition model.
8. The vehicle control method according to claim 7, wherein, For each of the plurality of image samples, the edge information and background information of that image sample are extracted, including: Global average pooling and global max pooling operations are performed on each of the plurality of image samples to obtain the edge information and background information of that image sample.
9. The vehicle control method according to claim 7, further comprising: For the identified one or more gesture actions, determine the loss function between the predicted bounding box and the labeled bounding box of the one or more gesture actions; Based on the edge information, the background information, and the recognized one or more gestures, the neural network model is trained to establish the gesture recognition model, including: The neural network model is trained based on the edge information, the background information, and the loss function to establish the gesture recognition model.
10. The vehicle control method according to claim 9, wherein, The loss function is the Inner Complete Intersection over Union (CIoU) loss function.
11. The vehicle control method according to claim 7, further comprising: The ReLU function of the linear rectifier unit is determined as the activation function of the neural network model; Based on the edge information, the background information, and the recognized one or more gestures, the neural network model is trained to establish the gesture recognition model, including: The neural network model is trained based on the edge information, the background information, and the activation function to establish the gesture recognition model.
12. The vehicle control method according to claim 7, wherein, Based on the edge information, the background information, and the recognized one or more gestures, the neural network model is trained to establish the gesture recognition model, including: A redundancy removal operation is performed on the neural network model.
13. The vehicle control method according to claim 7, further comprising: After establishing the gesture recognition model, graph optimization operations are performed on the convolutional layers, batch normalization layers, and activation functions of the gesture recognition model.
14. The vehicle control method according to claim 13, wherein, Performing the graph optimization operation on the convolutional layer, the batch normalization layer, and the activation function of the gesture recognition model includes: The gesture recognition model is reparameterized to couple multiple operators of the corresponding convolutional layer, batch normalization layer, and activation function of the gesture recognition model.
15. A method for training a gesture recognition model, comprising: Obtain a gesture action detection dataset, which includes multiple image samples of the vehicle while it is stopped or moving. For each of the plurality of image samples: Extract edge and background information from the image sample; Perform multi-scale forward augmentation on the image sample; and Identify one or more hand gestures in the image sample; as well as Based on the edge information, the background information, and the recognized one or more gestures, a neural network model is trained to establish a gesture recognition model.
16. A vehicle control device, comprising: An image capture unit is configured to capture images of the area around the vehicle to be identified via an image sensor on the vehicle. The gesture recognition unit is configured to use a pre-trained gesture recognition model to recognize the image to be recognized and obtain the gesture actions performed by the driver. as well as The control unit is configured to compare the hand gestures performed by the driver with predefined reference gestures, and when the hand gesture comparison results match, control the vehicle according to the operation command corresponding to the reference gesture.
17. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the vehicle control method of any one of claims 1-14 or the model training method of the gesture recognition model of claim 15.
18. A computer-readable storage medium storing computer instructions for causing a processor to execute and implement the vehicle control method of any one of claims 1-14 or the model training method of the gesture recognition model of claim 15.
19. A computer program product comprising a computer program that, when executed by a processor, implements the vehicle control method as described in any one of claims 1-14 or the model training method for the gesture recognition model as described in claim 15.