Citrus fruit picking positioning method, device, equipment and medium
By using NeRF and an improved RandLA-Net network for citrus fruit harvesting and localization, the problems of light variation and noise influence in existing technologies were solved, enabling accurate three-dimensional reconstruction of fruit trees and automated harvesting of mature fruits.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA AGRICULTURAL UNIVERSITY
- Filing Date
- 2024-06-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing image-based modeling methods are easily affected by factors such as changes in lighting and occlusion, making it difficult to accurately recover the complex structure of fruit trees. Point cloud-based modeling methods suffer from noise and sparsity issues, affecting modeling accuracy.
The Neural Radiation Field (NeRF) model combined with the Colmap algorithm was used for 3D reconstruction, and the improved RandLA-Net network was used for semantic segmentation. By introducing a bilateral enhancement module, accurate 3D reconstruction and semantic segmentation of fruit trees and fruits were achieved.
It enables precise 3D reconstruction of fruit trees, accurately identifies the location of ripe fruits, provides accurate 3D data for automated harvesting and fruit tree pruning, and improves modeling accuracy and robustness.
Smart Images

Figure CN118628913B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of three-dimensional reconstruction, and in particular to a method for locating and harvesting citrus fruits, a corresponding device, electronic equipment, and a computer-readable storage medium. Background Technology
[0002] 3D reconstruction of fruit trees is a crucial foundation for fruit spatial localization. By modeling the 3D structure of fruit trees, rich spatial information can be obtained, providing a reference for fruit positioning. Traditional 3D reconstruction methods for fruit trees mainly fall into two categories: image-based modeling and point cloud-based modeling.
[0003] Image-based modeling methods reconstruct the three-dimensional model of fruit trees by taking multi-view images of the fruit trees and using techniques such as structured motion (SfM) and multi-view stereo vision (MVS). Dong et al. [1] proposed a three-dimensional reconstruction and information extraction method for orchards based on SfM and MVS, which realized the three-dimensional visualization and parameter estimation of fruit trees. However, image-based modeling methods are easily affected by factors such as changes in lighting and occlusion, and it is difficult to accurately restore the complex structure of fruit trees.
[0004] Point cloud-based modeling methods use depth sensors such as LiDAR to directly acquire 3D point cloud data of fruit trees, and generate 3D models of the fruit trees through steps such as point cloud registration, fusion, and surface reconstruction. Gené-Mola et al. used vehicle-mounted LiDAR to scan orchards to acquire high-precision point cloud data, realizing 3D reconstruction of fruit trees and estimation of agronomic parameters. However, point cloud-based modeling methods are subject to problems such as noise and sparsity of the point cloud, which may affect the accuracy of the modeling, and a LiDAR system is generally expensive.
[0005] In summary, existing image-based modeling methods are easily affected by factors such as changes in lighting and occlusion, making it difficult to accurately recover the complex structure of fruit trees. Furthermore, point cloud-based modeling methods suffer from problems such as noise and sparsity in the point cloud, which may affect the accuracy of the modeling. To address these issues, the applicant has made corresponding explorations. Summary of the Invention
[0006] The purpose of this application is to solve the above-mentioned problems by providing a method for citrus fruit harvesting and positioning, a corresponding device, electronic equipment, and a computer-readable storage medium.
[0007] To achieve the various objectives of this application, the following technical solution is adopted:
[0008] A method for locating and harvesting citrus fruits, proposed to meet one of the purposes of this application, includes:
[0009] In response to the citrus fruit picking and positioning command, images of citrus trees from various angles are acquired, and a preset colmap algorithm is called to extract and match feature points from the citrus tree images from various angles to generate camera pose data corresponding to the citrus tree images.
[0010] Using a neural radiation field model trained to convergence, implicit 3D representations of citrus trees are learned from images of citrus trees at various angles based on the camera pose data, so as to generate 3D point cloud data corresponding to the citrus trees.
[0011] Based on a semantic segmentation model trained to convergence, end-to-end semantic segmentation is performed on the 3D point cloud data corresponding to the citrus trees to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits. The basic network architecture of the semantic segmentation model is an improved RandLA-Net network, which consists of an input layer, an output layer, 5 encoder layers, 5 decoder layers, a bilateral enhancement module, and a bottleneck layer. The bilateral enhancement module is located between the encoder layer and the bottleneck layer.
[0012] The harvesting robot determines the harvesting location of the mature citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and drives its robotic arm to harvest the mature citrus fruit according to the harvesting location, thereby completing the harvesting and positioning of the mature citrus fruit.
[0013] Optionally, the step of using a neural radiation field model trained to convergence to learn an implicit 3D representation of citrus trees from images of citrus trees at various angles based on the camera pose data to generate corresponding 3D point cloud data for the citrus trees includes:
[0014] The images of citrus trees from various angles, along with their corresponding camera pose data, are input into a neural radiation field model that has been trained to convergence. The scene is then densely sampled, and a multilayer perceptron is used to predict the volume density and view-dependent emissivity at each location to generate high-quality images.
[0015] Optionally, the step of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the ripe citrus fruits, includes:
[0016] In the encoder layer, each layer contains an extended residual block, which processes the input features and downsamples the features by sharing a multilayer perceptron, a local spatial coding module, and an attention pooling module.
[0017] Optionally, the step of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the ripe citrus fruits, includes:
[0018] In the extended residual block, the input features (N,d) are obtained. in ), where N is the number of points, d in This refers to the feature dimension of each point. The input point features are transformed into (N, d) dimensions through a shared multilayer perceptron. in / 2);
[0019] Local spatial encoding: Features (N, d) processed by the shared multilayer perceptron in / 2) Combine the three-dimensional coordinates (N,3) of each point for processing to generate a shape of (N,d) in Features of )
[0020] The features (N,d) output by the local spatial coding module in The attention pooling module first calculates the attention score, then weights the features based on the calculated attention score to generate new features (N, d). in );
[0021] The input features (N,d) in ) is transformed into (N,d) through a shared multilayer perceptron. in / 2), as the shortcut path;
[0022] Residual connection and activation function: The features of the main path and the shortcut path are added together, and then the LeakyReLU activation function is applied to output the final features (N, 2d). in ).
[0023] Optionally, the step of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the ripe citrus fruits, includes:
[0024] In the bilateral enhancement module, the relative position encoding module calculates the relative position and distance between a point and its neighboring points, and uses the relative position and distance between the point and its neighboring points as features. The calculation formula includes:
[0025] relative xyz =xyz tile -neighbor xyz ,
[0026]
[0027] relative feature =concat(relative) dis ,relative xyz ,xyz tile ,neighbor xyz ),
[0028] Among them, relative xyz This represents the relative position of each point with its neighboring points, neighbor xyz Represents the coordinates of neighboring points, xyz tile Obtained by copying xyz and neighbor xyz Matrices of the same shape, where xyz represents the three-dimensional coordinates of a point, and relative... dis It is the Euclidean distance between each point and its neighbors, relative. feature It is a combination of relative distance, relative position, original coordinates, and coordinates of neighboring points;
[0029] In the local feature aggregation module of the bilateral enhancement module, feature fusion and enhancement are achieved by combining relative position features and neighborhood point features. The calculation formula includes:
[0030] f concat1 =concat(f neighbours ,f xyz1 ),
[0031] f pcagg1 =attention pooling (f concat1 ),
[0032] f concat2 =concat(f neighbours ,f xyz2 ),
[0033] f pcagg2 =attention pooling (f concat2 ),
[0034] Among them, f concat1 It is the feature f of the neighborhood points neighbours and relative position features f xyz1 splicing characteristics, f pcagg1 It is through attention pooling pooling For f concat1 Features after weighted aggregation;
[0035] In the attention pooling module of the bilateral enhancement module, the attention weight of each point to its neighboring points is calculated, and the features of the neighboring points are weighted and averaged to further enhance the feature representation. The calculation formula includes:
[0036] att activation =dense(f reshaped ),
[0037] att scores =softmax(att) activation ),
[0038] f agg =∑(f reshaped *att scores ),
[0039] Among them, att activation The reshaped feature f reshaped The activation value, att, obtained by performing a fully connected operation scores Attention activation value (att) activation The attention weights, f, are obtained by performing the softmax activation function operation. agg By reshaping the feature f reshaped According to attention weights att scores Aggregated features obtained by weighted averaging;
[0040] The expanded residual block, combined with the local feature aggregation module and the attention pooling module, enhances and propagates features. Its calculation formula includes:
[0041] f pc1 =conv2d(feature),
[0042]
[0043] shortcut = conv2d(feature),
[0044] output = leaky relu (f pc2 +shortcut),
[0045] Among them, f pc1 These are intermediate features obtained by performing a convolution operation on the input features. Building by aggregating local features block For f pc1 The features obtained through processing are called shortcuts, which are shortcut connection features obtained by performing convolution operations on the input features. The output is obtained by processing f. pc2The final output feature is obtained by adding the shortcut and passing it through the leaky ReLU activation function.
[0046] Optionally, the step of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the ripe citrus fruits, includes:
[0047] The improved RandLA-Net network is trained using a pre-defined training set, where each point in the training set has been labeled as a fruit or a non-fruit part. During training, the improved RandLA-Net network learns to identify and classify different fruit tree components from the local structural features of the point cloud through its random sampling and local feature aggregation mechanism.
[0048] After training, the preprocessed fruit tree point cloud data is input into the improved RandLA-Net network for semantic segmentation. The improved RandLA-Net network outputs the semantic label of each point, thereby distinguishing between mature citrus fruits, immature citrus fruits, and citrus tree parts.
[0049] Optionally, after the step of the harvesting robot determining the harvesting position of the ripe citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and driving its robotic arm to harvest the ripe citrus fruit according to the harvesting position, the process includes:
[0050] In response to citrus tree pruning instructions, determine the corresponding 3D point cloud data of the citrus trees;
[0051] Based on the three-dimensional point cloud data corresponding to the citrus trees, the corresponding positions of citrus branches and trunks are determined. Based on the corresponding positions of the citrus branches, the citrus trees are pruned to complete the harvesting and positioning of the mature citrus fruits.
[0052] A citrus fruit picking positioning device provided for another purpose of this application includes:
[0053] The camera pose determination module is configured to respond to the citrus fruit picking and positioning command, acquire citrus tree images from various angles, call the preset colmap algorithm to extract feature points from the citrus tree images from various angles and match them to generate camera pose data corresponding to the citrus tree images.
[0054] The fruit tree point cloud determination module is configured to use a neural radiation field model trained to convergence to learn the implicit three-dimensional representation of citrus fruit trees from the citrus fruit tree images from various angles based on the camera pose data, so as to generate the corresponding three-dimensional point cloud data of the citrus fruit trees.
[0055] The semantic segmentation module is configured to perform end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model that has been trained to convergence, so as to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits. The basic network architecture of the semantic segmentation model is an improved RandLA-Net network, which consists of an input layer, an output layer, 5 encoder layers, 5 decoder layers, a bilateral enhancement module, and a bottleneck layer. The bilateral enhancement module is located between the encoder layer and the bottleneck layer.
[0056] The picking and positioning module is configured to allow the picking robot to determine the picking location of the ripe citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and drive its robotic arm to pick the ripe citrus fruit according to the picking location, thereby completing the picking and positioning of the ripe citrus fruit.
[0057] An electronic device provided for another purpose of this application includes a central processing unit and a memory, the central processing unit being used to invoke and run a computer program stored in the memory to perform the steps of the citrus fruit harvesting and positioning method of this application.
[0058] A computer-readable storage medium is provided for another purpose of this application, which stores, in the form of computer-readable instructions, a computer program implemented according to the citrus fruit picking and positioning method, which, when invoked by a computer, executes the steps included in the corresponding method.
[0059] Compared to existing technologies, this application addresses the problems of existing image-based modeling methods being easily affected by factors such as lighting changes and occlusion, making it difficult to accurately recover the complex structure of fruit trees, and point cloud-based modeling methods, where noise and sparsity of point clouds may affect the accuracy of modeling. This application provides, but is not limited to, the following beneficial effects:
[0060] Firstly, the Neural Radiation Field Model (NeRF) output is used to construct detailed point cloud maps of fruit trees. These point cloud maps not only capture the geometric structure and color information of the fruit trees and fruits, but also reflect the complex texture and branch distribution of the tree, providing accurate 3D data for subsequent agricultural applications such as fruit maturity assessment, automated fruit harvesting, automated pruning, and health monitoring. By combining the colmap algorithm and the NeRF model, this application can achieve accurate 3D reconstruction of fruit trees, and the effectiveness of this method has been verified by experimental results. Furthermore, this method provides a new technical approach for acquiring 3D structural data of fruit trees in a non-invasive manner, and has the potential to be extended to other plants or complex scenarios.
[0061] Secondly, this application proposes an improved RandLA-Net network and applies it to the fruit semantic segmentation task. By introducing a bilateral enhancement module, accurate semantic segmentation of ripe fruit, immature fruit, and fruit tree branches and leaves is achieved. The improved RandLA-Net network exhibits higher accuracy and robustness when processing fruit data.
[0062] Third, this application introduces a bilateral enhancement module (BEM) on the basis of the original RandLA-Net network, which effectively improves the network's performance in fruit semantic segmentation tasks. The bilateral enhancement module (BEM) enhances the effect of feature processing through the Local Spatial Encoding (LoCSE) module and the Attentive Pooling module.
[0063] Fourth, the bilateral augmentation module combines spatial proximity weights with feature similarity weights to generate a comprehensive weight for feature enhancement. This fusion effectively highlights the features of small targets (i.e., fruits) while suppressing background noise. Through the application of the comprehensive weight, the bilateral augmentation module enhances local details in the fruit region and maintains the clarity of fruit edges. This is crucial for accurate segmentation and subsequent agricultural operations (such as automated harvesting). Specifically, the bilateral augmentation module uses these weights to adjust the feature representation of each point, thereby enhancing target features and suppressing non-target features and noise. Experiments demonstrating the application of the bilateral augmentation module show that it helps maintain the clarity of fruit edges, which is very important for subsequent agricultural operations such as automated harvesting. By strengthening the distinction between fruits and other plant parts, the bilateral augmentation module enables the RandLA-Net network to learn and predict the accurate location and extent of mature fruits more effectively, thereby optimizing the overall segmentation performance.
[0064] Furthermore, the citrus fruit harvesting and positioning method of this application, through precise and rapid positioning of mature fruits, lays a solid theoretical foundation for the automated harvesting of mature fruits, greatly saving manpower and material resources. Attached Figure Description
[0065] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:
[0066] Figure 1 This is a flowchart illustrating the method for citrus fruit harvesting and positioning in an embodiment of this application.
[0067] Figure 2 This is a schematic diagram of the RGB image of a citrus tree and the point cloud image of the tree after three-dimensional reconstruction in an embodiment of this application;
[0068] Figure 3 This is a schematic diagram of the improved RandLA-Net network structure with added bilateral enhancement modules in the embodiments of this application;
[0069] Figure 4 This is a schematic diagram illustrating the change in training accuracy during the model training process in an embodiment of this application;
[0070] Figure 5 This is a schematic diagram illustrating the change in learning rate during model training in an embodiment of this application;
[0071] Figure 6 This is a schematic diagram illustrating the change of the loss value during the model training process in the embodiments of this application;
[0072] Figure 7 This is a schematic diagram of the original input image in an embodiment of this application;
[0073] Figure 8 This is a schematic diagram of the label for each point in the image in the embodiments of this application;
[0074] Figure 9 This is a schematic diagram of the segmentation effect of the original RandLA-Net network in the embodiments of this application;
[0075] Figure 10 This is a schematic diagram illustrating the segmentation effect of the improved RandLA-Net network in the embodiments of this application;
[0076] Figure 11 This is a schematic diagram of the citrus fruit picking and positioning device in the embodiments of this application;
[0077] Figure 12 This is a schematic diagram of the structure of the computer device in the embodiments of this application. Detailed Implementation
[0078] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application.
[0079] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this application means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.
[0080] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0081] Those skilled in the art will understand that the terms "client," "terminal," and "terminal device" as used herein include both devices that receive wireless signals, devices that only possess wireless signal receiver capabilities without transmission capabilities, and devices with receiving and transmitting hardware, devices that have receiving and transmitting hardware capable of bidirectional communication over a bidirectional communication link. Such devices may include: cellular or other communication devices such as personal computers or tablets, having single-line displays, multi-line displays, or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service) that can combine voice, data processing, fax, and / or data communication capabilities; PDA (Personal Digital Assistant) that may include a radio frequency receiver, pager, internet / intranet access, web browser, notepad, calendar, and / or GPS (Global Positioning System) receiver; and conventional laptops and / or handheld computers or other devices that have and / or include radio frequency receivers. As used herein, "client," "terminal," and "terminal device" can be portable, transportable, installed in a means of transportation (air, sea, and / or land), or suitable and / or configured to operate locally and / or in a distributed manner, operating in any other location on Earth and / or in space. "Client," "terminal," and "terminal device" as used herein can also be a communication terminal, an internet access terminal, or a music / video playback terminal, such as a PDA, a MID (Mobile Internet Device), and / or a mobile phone with music / video playback capabilities, or a smart TV, set-top box, etc.
[0082] The hardware referred to by the names "server," "client," and "service node" in this application is essentially an electronic device with the equivalent capabilities of a personal computer. It is a hardware device with the necessary components revealed by the von Neumann architecture, such as a central processing unit (including an arithmetic logic unit and a control unit), memory, input devices, and output devices. The computer program is stored in its memory, and the central processing unit loads the program stored in the secondary storage into the main memory to run it, execute the instructions in the program, and interact with the input and output devices to complete specific functions.
[0083] It should be noted that the concept of "server" used in this application can also be extended to the case of server clusters. Based on the network deployment principles understood by those skilled in the art, the servers should be logically divided. Physically, these servers can be independent of each other but accessible through interfaces, or they can be integrated into a single physical computer or a computer cluster. Those skilled in the art should understand this flexibility and should not use it to constrain the implementation of the network deployment method in this application.
[0084] One or more of the technical features of this application, unless explicitly specified herein, can be deployed on a server and accessed by a client remotely calling the online service interface provided by the server, or can be directly deployed and run on a client for access.
[0085] Unless otherwise specified, the neural network models referenced or potentially referenced in this application may be deployed on a remote server and invoked remotely on the client, or deployed on a client with the capability to invoke directly. In some embodiments, when running on the client, the corresponding intelligence may be acquired through transfer learning in order to reduce the requirements on the client's hardware resources and avoid excessive consumption of the client's hardware resources.
[0086] Unless otherwise specified, all data involved in this application may be stored remotely on a server or on a local terminal device, as long as it is suitable for use by the technical solution of this application.
[0087] Those skilled in the art will understand that although the various methods in this application are described based on the same concept and thus present commonality among them, they can be performed independently unless otherwise specified. Similarly, the various embodiments disclosed in this application are all based on the same inventive concept; therefore, concepts expressed in the same way, as well as concepts that are appropriately changed for convenience but are expressed differently, should be understood equivalently.
[0088] Unless otherwise expressly stated, the various embodiments disclosed in this application can be combined in a cross-cutting manner to flexibly construct new embodiments, as long as such combination does not depart from the inventive spirit of this application and can meet the needs of the prior art or solve a certain deficiency in the prior art. Those skilled in the art should be aware of such modifications.
[0089] Please see Figure 1 In one embodiment of the citrus fruit harvesting and positioning method of this application, the method includes:
[0090] Step S10: Respond to the citrus fruit picking and positioning command, acquire citrus tree images from various angles, call the preset colmap algorithm to extract feature points from the citrus tree images from various angles and match them to generate camera pose data corresponding to the citrus tree images.
[0091] The harvesting robot can respond to the citrus fruit harvesting and positioning command to acquire images of citrus trees from various angles. These images can be obtained using the binocular camera in the harvesting robot. After acquiring the images from various angles, the harvesting robot calls a preset colmap algorithm to extract and match feature points from the images to generate camera pose data corresponding to the citrus tree images.
[0092] Specifically, the Colmap algorithm is used to generate camera pose data corresponding to the citrus image. This camera pose data includes the camera's position and orientation information. The Colmap algorithm is an efficient image-based 3D reconstruction algorithm widely used in computer vision and photogrammetry. It generates a dense 3D point cloud model by extracting and matching feature points from images from multiple perspectives. The Colmap algorithm supports automated feature extraction, matching, incremental structure-from-motion (SfM), and dense reconstruction, enabling the generation of high-precision 3D models without relying on specific hardware. Therefore, this application uses the Colmap algorithm to calculate the camera pose data of the citrus image, and the reconstructed portion uses a Neural Radiation Field (NeRF) model.
[0093] Step S20: Using a neural radiation field model that has been trained to convergence, learn the implicit three-dimensional representation of citrus trees from the images of citrus trees from various angles based on the camera pose data, so as to generate three-dimensional point cloud data corresponding to the citrus trees.
[0094] After generating the camera pose data corresponding to the citrus tree images, a neural radiation field model (NeRF) trained to convergence is used to learn the implicit 3D representation of the citrus tree from the citrus tree images at various angles based on the camera pose data, so as to generate the 3D point cloud data corresponding to the citrus tree.
[0095] The steps of learning an implicit 3D representation of a citrus tree from images of citrus trees at various angles based on the camera pose data using a neural radiation field model trained to convergence, in order to generate corresponding 3D point cloud data for the citrus tree, include:
[0096] The images of citrus trees from various angles, along with their corresponding camera pose data, are input into a neural radiation field model that has been trained to convergence. The scene is then densely sampled, and a multilayer perceptron is used to predict the volume density and view-dependent emissivity at each location to generate high-quality images.
[0097] Specifically, after obtaining the camera pose data corresponding to the citrus tree image, this camera pose data and the RGB image corresponding to the citrus tree image are input into the Neural Radiation Field Model (NeRF). The Neural Radiation Field Model (NeRF) achieves the ability to generate high-quality images from new perspectives by densely sampling the scene and using a multilayer perceptron (MLP) to predict the volume density and view-dependent emissivity at each location.
[0098] Please see Figure 2 In the final stage of the Neural Radiation Field Model (NeRF) processing, the NeRF output is used to construct detailed point cloud maps of fruit trees. These point cloud maps not only capture the geometric structure and color information of the fruit trees and fruits, but also reflect the complex texture of the tree and the distribution of branches. This provides accurate three-dimensional data for subsequent agricultural applications such as fruit maturity judgment, automated fruit harvesting, automated pruning of fruit trees, and health monitoring.
[0099] By combining the colmap algorithm and the Neural Radiation Field (NeRF) model, this application enables accurate 3D reconstruction of fruit trees, and the effectiveness of this method has been verified through experimental results. Furthermore, this method provides a novel technical approach for acquiring 3D structural data of fruit trees in a non-invasive manner, and has the potential to be extended to other plants or complex scenarios.
[0100] Step S30: Based on the semantic segmentation model trained to convergence, perform end-to-end semantic segmentation on the three-dimensional point cloud data corresponding to the citrus tree to extract the three-dimensional point cloud data and three-dimensional spatial coordinates corresponding to the mature citrus fruit. The basic network architecture of the semantic segmentation model is an improved RandLA-Net network. The improved RandLA-Net network consists of an input layer, an output layer, 5 encoder layers, 5 decoder layers, a bilateral enhancement module, and a bottleneck layer. The bilateral enhancement module is located between the encoder layer and the bottleneck layer.
[0101] After generating the 3D point cloud data corresponding to the citrus trees, end-to-end semantic segmentation is performed on the 3D point cloud data corresponding to the citrus trees based on the semantic segmentation model that has been trained to convergence, so as to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits. The basic network architecture of the semantic segmentation model is an improved RandLA-Net network. The improved RandLA-Net network consists of an input layer, an output layer, 5 encoder layers, 5 decoder layers, a bilateral enhancement module, and a bottleneck layer. The bilateral enhancement module is located between the encoder layer and the bottleneck layer.
[0102] Furthermore, the step of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on the semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the ripe citrus fruits, includes:
[0103] In the encoder layer, each layer contains an extended residual block, which processes the input features and downsamples the features by sharing a multilayer perceptron, a local spatial coding module, and an attention pooling module.
[0104] Furthermore, the step of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on the semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the ripe citrus fruits, includes:
[0105] Step S301: Obtain the input features (N,d) in the expanded residual block. in ), where N is the number of points, d in This refers to the feature dimension of each point. The input point features are transformed into (N, d) dimensions through a shared multilayer perceptron. in / 2);
[0106] Step S303, Local Spatial Encoding: Features (N, d) processed by the shared multilayer perceptron in / 2) Combine the three-dimensional coordinates (N,3) of each point for processing to generate a shape of (N,d) in Features of )
[0107] Step S305, Attention Pooling: The features (N, d) output by the local spatial encoding module in The attention pooling module first calculates the attention score, then weights the features based on the calculated attention score to generate new features (N, d). in );
[0108] Step S307, Shortcut path: The input features (N, d)in ) is transformed into (N,d) through a shared multilayer perceptron. in / 2), as the shortcut path;
[0109] Step S309, Residual Connection and Activation Function: Add the features of the main path and the shortcut path, and then use the Leaky ReLU activation function to output the final features (N, 2d). in ).
[0110] In this application, the improved RandLA-Net network (random local point cloud feature aggregation network) is applied to the point cloud data of citrus fruit trees generated by the Neural Radiation Field Model (NeRF) to achieve accurate semantic segmentation of mature fruits, immature fruits and other parts of the citrus tree.
[0111] Compared to the original RandLA-Net network, the main improvement of this application is the addition of a Bilateral Enhancement Module (BEM) between the original Encoder Layer and Bottleneck Layer. The main structure of the improved RandLA-Net consists of an input layer, an output layer, five encoder layers and five decoder layers, a BEM layer and a bottleneck layer.
[0112] Please see Figure 3 , Figure 3 This is an improved RandLA-Net network architecture diagram with added bilateral enhancement modules.
[0113] Among them, the dilated residual block belongs to the encoder layer in the network structure. In the encoder part, each layer uses dilated residual blocks to process input features. These residual blocks are responsible for local feature aggregation and feature pyramid construction in the encoder layer.
[0114] Specifically, the encoder layer structure is as follows: Encoder Layer 1 to Encoder Layer 5, each containing an extended residual block. The extended residual block processes the input features and downsamples the features through a shared multilayer perceptron (Shared MLP), a local spatial encoding (LoCSE) module, and an attentional pooling module.
[0115] Furthermore, the processing steps for the expanded residual block include:
[0116] Obtain input features (N,d) in), where N is the number of points, d in This refers to the feature dimension of each point; the input features are first transformed into (N, d) dimensions by a shared multilayer perceptron (MLP). in / 2);
[0117] Local spatial encoding: Features (N, d) processed by a shared multilayer perceptron (MLP) in / 2) Combine the three-dimensional coordinates (N,3) of each point for processing to generate a shape of (N,d) in The feature of ) is obtained by calculating the relative position code of each point with respect to its nearest neighbor and then concatenating it with the input features;
[0118] Attention pooling: Features (N, d) output by the local spatial encoding module in The attention pooling mechanism is used to first calculate the attention score. The calculated attention score is then used to weight the features, generating new features (N, d). in );
[0119] Shortcut path (skip connection): initial features of the input (N, d) in It is directly converted to (N, 2d) through a shared multilayer perceptron (MLP). in This part implements residual connections, allowing input features to jump directly to the output while preserving the original feature information.
[0120] Residual connection and activation function: The features of the main path and the shortcut path are added together, and the final feature (N, 2d) is output through the LeakyReLU activation function. in This step outputs (N, 2d) in The (N, 2d) path is obtained by combining high-level features obtained through feature extraction and aggregation, i.e., the shortcut path. in The residual connection method, which preserves input feature information (features on the shortcut path), helps to alleviate the gradient vanishing problem in deep networks and improves the network's performance.
[0121] In some embodiments, the shortcut path directly passes the input data to later layers of the network without undergoing complex intermediate transformations. This design allows the network to learn the residuals of the input (i.e., the difference between the original input and the desired output), thereby improving the training performance and stability of the model.
[0122] Furthermore, segmentation of citrus fruit point cloud data requires highly accurate differentiation between the fruit and its surrounding environment, which is particularly challenging in high-density vegetation. Fruits are typically small and spatially connected to other plant components, making it difficult for ordinary point cloud processing networks to segment them accurately. Bilateral augmentation modules, by considering the spatial and feature similarity between points and their neighbors, can effectively enhance the feature representation of citrus fruits. Specifically, the bilateral augmentation module utilizes spatial proximity and feature similarity weights to maintain local details in the fruit region while suppressing noise and non-target features, thereby improving segmentation accuracy and robustness.
[0123] The bilateral enhancement module (BEM) achieves feature enhancement through the following steps:
[0124] In the bilateral enhancement module, the relative position encoding module calculates the relative position and distance between a point and its neighboring points, and uses the relative position and distance between the point and its neighboring points as features. The calculation formula includes:
[0125] relative xyz =xyz tile -neighbor xyz ,
[0126]
[0127] relative feature =concat(relative) dis ,relative xyz ,xyz tile ,neighbor xyz ),
[0128] Among them, relative xyz This represents the relative position of each point with its neighboring points, neighbor xyz Represents the coordinates of neighboring points, xyz tile Obtained by copying xyz and neighbor xyz Matrices of the same shape, where xyz represents the three-dimensional coordinates of a point, and relative... dis It is the Euclidean distance between each point and its neighbors, relative. feature It is a combination of relative distance, relative position, original coordinates, and coordinates of neighboring points;
[0129] In the local feature aggregation module of the bilateral enhancement module, feature fusion and enhancement are achieved by combining relative position features and neighborhood point features. The calculation formula includes:
[0130] f concat1=concat(f neighbours ,f xyz1 ),
[0131] f pcagg1 =attention pooling (f concat1 ),
[0132] f concat2 =concat(f neighbours ,f xyz2 ),
[0133] f pcagg2 =attention pooling (f concat2 ),
[0134] Among them, f concat1 It is the feature f of the neighborhood points neighbours and relative position features f xyz1 splicing characteristics, f pcagg1 It is through attention pooling pooling For f concat1 Features after weighted aggregation;
[0135] In the attention pooling module of the bilateral enhancement module, the attention weight of each point to its neighboring points is calculated, and the features of the neighboring points are weighted and averaged to further enhance the feature representation. The calculation formula includes:
[0136] att activation =dense(f reshaped ),
[0137] att scores =softmax(att) activation ),
[0138] f agg =∑(f reshaped *att scores ),
[0139] Among them, att activation The reshaped feature f reshaped The activation value, att, obtained by performing a fully connected operation scores Attention activation value (att) activation The attention weights, f, are obtained by performing the softmax activation function operation. agg By reshaping the feature f reshaped According to attention weights att scores Aggregated features obtained by weighted averaging;
[0140] The expanded residual block, combined with the local feature aggregation module and the attention pooling module, enhances and propagates features. Its calculation formula includes:
[0141] f pc1 =conv2d(feature),
[0142]
[0143] shortcut = conv2d(feature),
[0144] output = leaky relu (f pc2 +shortcut),
[0145] Among them, f pc1 These are intermediate features obtained by performing a convolution operation on the input features. Building by aggregating local features block For f pc1 The features obtained after processing are shortcut features, which are obtained by performing a convolution operation on the input features. The output is obtained by processing f... pc2 The final output feature is obtained by adding the shortcut and passing it through the leaky ReLU activation function.
[0146] As can be seen from the above embodiments, the bilateral enhancement module combines spatial proximity weights and feature similarity weights to generate a comprehensive weight for feature enhancement. This fusion method can effectively highlight the features of small targets (i.e., fruits) while suppressing background noise;
[0147] By applying comprehensive weights, the bilateral augmentation module enhances local details in the fruit region while maintaining the sharpness of fruit edges. This is crucial for accurate segmentation and subsequent agricultural operations such as automated harvesting. Specifically, the bilateral augmentation module uses these weights to adjust the feature representation of each point, thereby enhancing target features and suppressing non-target features and noise. Experiments demonstrating the application of the bilateral augmentation module show that it helps maintain the sharpness of fruit edges, which is essential for subsequent agricultural operations such as automated harvesting. By strengthening the distinction between the fruit and other plant parts, the bilateral augmentation module enables RandLA-Net to more effectively learn and predict the accurate location and extent of the fruit, thus optimizing overall segmentation performance.
[0148] The steps of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits, include:
[0149] Step S3001: Train the improved RandLA-Net network using a preset training set, wherein each point in the training set has been labeled as a fruit or a non-fruit part. During the training process, the improved RandLA-Net network learns to identify and classify different fruit tree components from the local structural features of the point cloud through its random sampling and local feature aggregation mechanism.
[0150] Step S3003: After training is completed, the preprocessed fruit tree point cloud data is input into the improved RandLA-Net network for semantic segmentation. The improved RandLA-Net network outputs the semantic label of each point, thereby distinguishing between mature citrus fruits, immature citrus fruits, and citrus tree parts.
[0151] Specifically, necessary preprocessing steps are performed on the citrus tree point cloud data generated by the Neural Radiation Field Model (NeRF), including filtering and normalization, to standardize the citrus tree point cloud data and eliminate noise and outliers. These steps ensure the quality of the input data and lay a solid foundation for subsequent efficient segmentation.
[0152] The specific steps for preprocessing point cloud data of citrus trees are as follows:
[0153] Converting citrus tree point cloud data into PLY files: The original point cloud data file is converted into a PLY format file, with each line containing information such as location, color, and label. All points of each instance are aggregated to generate point cloud data containing fruit and background labels.
[0154] Grid sampling and KDTree generation: The point cloud data of citrus trees was downsampled using a grid sampling method. Points within each 0.04m cube were averaged, and the number of categories within each cube was counted. The category with the highest percentage was selected as the sampled category. The sampled point cloud data was saved as a PLY file, and a KDTree was generated for efficient spatial querying. Projection information was saved for verification purposes.
[0155] Furthermore, the improved RandLA-Net network was trained using a detailed labeled training set, where each point was labeled as a ripe citrus fruit, an unripe citrus fruit, or a part of the citrus tree.
[0156] During training, the improved RandLA-Net network learns to identify and classify different fruit tree components from the local structural features of the point cloud through its random sampling and local feature aggregation mechanism.
[0157] After training, the preprocessed fruit tree point cloud data is input into the improved RandLA-Net network for semantic segmentation. The network outputs a semantic label for each point, thereby distinguishing between mature citrus fruits, immature citrus fruits, and citrus tree parts. This process focuses on accurately identifying and segmenting the point cloud data corresponding to mature and immature citrus fruits from the complex tree structure.
[0158] Furthermore, the segmentation results are evaluated to verify the performance of the improved RandLA-Net network in the semantic segmentation task of fruit tree point clouds. Common evaluation metrics, such as recognition accuracy, intersection over union (IOU) of a single target, and mean IOU, are used to measure the accuracy and effectiveness of the model in recognizing and segmenting fruits.
[0159] Through these steps, this application demonstrates the effectiveness and accuracy of the improved RandLA-Net in processing complex fruit tree point cloud data generated by an advanced Neural Radiation Field (NeRF) model, thereby supporting automated fruit tree management and decision-making.
[0160] Step S40: The harvesting robot determines the harvesting position of the mature citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and drives its robotic arm to harvest the mature citrus fruit according to the harvesting position, so as to complete the harvesting and positioning of the mature citrus fruit.
[0161] After extracting the three-dimensional point cloud data and three-dimensional spatial coordinates corresponding to the ripe citrus fruit, the harvesting robot determines the harvesting position of the ripe citrus fruit based on the three-dimensional point cloud data and three-dimensional spatial coordinates, and drives its robotic arm to harvest the ripe citrus fruit according to the harvesting position, so as to complete the harvesting and positioning of the ripe citrus fruit.
[0162] In some embodiments, after the harvesting robot determines the harvesting position of the ripe citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and drives its robotic arm to harvest the ripe citrus fruit according to the harvesting position, the process includes:
[0163] Step S401: The pruning robot responds to the citrus tree pruning command and determines the corresponding three-dimensional point cloud data of the citrus tree;
[0164] Step S403: Based on the three-dimensional point cloud data corresponding to the citrus tree, determine the corresponding positions of the citrus branches and the citrus trunk, and prune the citrus tree based on the corresponding positions of the citrus branches to complete the harvesting and positioning of the mature citrus fruit.
[0165] In some embodiments, to verify the effectiveness and robustness of the BEM module, this application applies the RandLA-Net network containing the BEM module to the large public dataset S3DIS and a self-made fruit point cloud dataset. The performance of the model on these datasets is evaluated using three metrics: overall classification accuracy (eval accuracy), mean intersection-over-union ratio (mIoU), and mean classification accuracy (mAcc). The formulas for calculating each metric are as follows:
[0166]
[0167] Among them, Accuracy i Let represent the accuracy for class i, N be the total number of classes, TP be the true positive, FP be the false positive, and FN be the false negative. In some embodiments, in the training setup, we trained for 100 epochs on a single GeForce RTX 3090 GPU, with each epoch containing 500 gradient updates. The batch size during training was set to 6, and the initial learning rate was 1e-2. Our project, RandLA-Net-BEM, is implemented using a Linux system and the TensorFlow 2.60 framework.
[0168] Please see Figures 4 to 6 ,in, Figure 4 A diagram illustrating the changes in training accuracy during model training; Figure 5 A diagram illustrating the change in the learning rate during model training; Figure 6 The diagram illustrates the changes in loss values during model training. Analyzing these graphs, we can draw the following conclusions: Training accuracy (Over Steps) rapidly rises to a high level (over 90%) within the first 5000 steps, then the rate of increase slows but remains stable, indicating that the model can effectively learn and maintain high accuracy. The training learning rate (Over Steps) initially decreases gradually, a result of a learning rate decay strategy where the learning rate is multiplied by a fixed factor (0.95) after each training iteration until a stable value is reached. Gradually reducing the learning rate helps the model quickly improve performance in the early stages of training and avoids excessive oscillations through finer adjustments later. The training loss (Over Steps) decreases rapidly at the beginning, indicating that the model can quickly reduce errors in the early stages. As the number of training steps increases, the rate of loss decrease slows, but the overall trend remains downward, indicating that the model continuously optimizes its parameters.
[0169] In some embodiments, to more intuitively demonstrate the actual functionality of the Bilateral Augmentation Module (BEM), this application compares the effects of the BEM module from the following four perspectives: original input data, label of each point in the image, segmentation effect of the original RandLA-Net network, and segmentation effect of the improved RandLA-Net network; please refer to [link / reference]. Figures 7 to 10 ,in, Figure 7 A schematic diagram representing the original input image; Figure 8 A diagram showing the label of each point in the image; Figure 9 A schematic diagram illustrating the segmentation effect of the original RandLA-Net network; Figure 10 A schematic diagram illustrating the segmentation performance of the improved RandLA-Net network is shown. Analysis of the testing and visualization results of the Bilateral Augmentation Module (BEM) clearly demonstrates its effectiveness in the segmentation task. Comparing the segmentation performance from four perspectives, the results show that the original RandLA-Net network performs well in identifying fruits not obscured by branches and leaves, but its performance significantly decreases when dealing with fruits partially obscured by branches and leaves within the tree canopy. When processing the S3DIS dataset, the improved model also shows a significantly higher success rate in identifying target boundaries than the original RandLA-Net network. From the performance on both datasets, it is evident that the RandLA-Net network (Ours) integrating the BEM module outperforms the original network in both detail processing and overall segmentation. Therefore, it can be concluded that the bilateral augmentation module proposed in this application plays a significant role in enhancing point cloud segmentation tasks, significantly improving the model's accuracy and reliability.
[0170] As can be seen from the above embodiments, this application proposes an improved RandLA-Net network and applies it to the fruit semantic segmentation task. By introducing a bilateral enhancement module, accurate semantic segmentation of ripe fruit, immature fruit, and fruit tree branches and leaves is achieved. Experimental results show that the improved RandLA-Net network exhibits higher accuracy and robustness when processing fruit data.
[0171] This application introduces a bilateral augmentation module (BEM) based on the original RandLA-Net network, effectively improving the network's performance in fruit semantic segmentation tasks. The bilateral augmentation module (BEM) enhances the feature processing effect through a local spatial encoding (LoCSE) module and an attentional pooling module;
[0172] Experiments conducted on a self-made fruit point cloud dataset show that the improved RandLA-Net network outperforms the original network on multiple evaluation metrics, including mean classification accuracy (mAcc), overall classification accuracy (eval accuracy), and mean intersection-over-union ratio (mIoU). These results demonstrate that adding a bilateral enhancement module can significantly improve the accuracy of fruit semantic segmentation.
[0173] By visualizing the experimental results, we can intuitively see the actual effect of the bilateral enhancement module in the segmentation task. The improved network shows better segmentation effect in the boundary processing of fruits and branches, especially in the case of complex backgrounds, it can better identify mature fruits.
[0174] As can be seen from the above embodiments, compared with the prior art, this application addresses the problems of existing image-based modeling methods being easily affected by factors such as changes in lighting and occlusion, making it difficult to accurately recover the complex structure of fruit trees, and point cloud-based modeling methods, where the noise and sparsity of point clouds may affect the accuracy of modeling. This application has, but is not limited to, the following beneficial effects:
[0175] Firstly, the Neural Radiation Field Model (NeRF) output is used to construct detailed point cloud maps of fruit trees. These point cloud maps not only capture the geometric structure and color information of the fruit trees and fruits, but also reflect the complex texture and branch distribution of the tree, providing accurate 3D data for subsequent agricultural applications such as fruit maturity assessment, automated fruit harvesting, automated pruning, and health monitoring. By combining the colmap algorithm and the NeRF model, this application can achieve accurate 3D reconstruction of fruit trees, and the effectiveness of this method has been verified by experimental results. Furthermore, this method provides a new technical approach for acquiring 3D structural data of fruit trees in a non-invasive manner, and has the potential to be extended to other plants or complex scenarios.
[0176] Secondly, this application proposes an improved RandLA-Net network and applies it to the fruit semantic segmentation task. By introducing a bilateral enhancement module, accurate semantic segmentation of ripe fruit, immature fruit, and fruit tree branches and leaves is achieved. The improved RandLA-Net network exhibits higher accuracy and robustness when processing fruit data.
[0177] Third, this application introduces a bilateral enhancement module (BEM) on the basis of the original RandLA-Net network, which effectively improves the network's performance in fruit semantic segmentation tasks. The bilateral enhancement module (BEM) enhances the effect of feature processing through the Local Spatial Encoding (LoCSE) module and the Attentive Pooling module.
[0178] Fourth, the bilateral augmentation module combines spatial proximity weights with feature similarity weights to generate a comprehensive weight for feature enhancement. This fusion effectively highlights the features of small targets (i.e., fruits) while suppressing background noise. Through the application of the comprehensive weight, the bilateral augmentation module enhances local details in the fruit region and maintains the clarity of fruit edges. This is crucial for accurate segmentation and subsequent agricultural operations (such as automated harvesting). Specifically, the bilateral augmentation module uses these weights to adjust the feature representation of each point, thereby enhancing target features and suppressing non-target features and noise. Experiments demonstrating the application of the bilateral augmentation module show that it helps maintain the clarity of fruit edges, which is very important for subsequent agricultural operations such as automated harvesting. By strengthening the distinction between fruits and other plant parts, the bilateral augmentation module enables the RandLA-Net network to learn and predict the accurate location and extent of mature fruits more effectively, thereby optimizing the overall segmentation performance.
[0179] Furthermore, the citrus fruit harvesting and positioning method of this application, through precise and rapid positioning of mature fruits, lays a solid theoretical foundation for the automated harvesting of mature fruits, greatly saving manpower and material resources.
[0180] Please see Figure 11This application provides a citrus fruit picking and positioning device, comprising a camera pose determination module 1100, a fruit tree point cloud determination module 1200, a semantic segmentation module 1300, and a picking and positioning module 1400, for one of the purposes of this application. The camera pose determination module 1100 is configured to respond to a citrus fruit picking and positioning command, acquire citrus tree images from various angles, and use a preset colmap algorithm to extract and match feature points from the citrus tree images from each angle to generate camera pose data corresponding to the citrus tree images. The fruit tree point cloud determination module 1200 is configured to use a neural radiation field model trained to convergence to learn an implicit 3D representation of the citrus tree from the citrus tree images from each angle based on the camera pose data, to generate 3D point cloud data corresponding to the citrus tree. The semantic segmentation module 1300 is configured to perform end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus tree based on a semantic segmentation model trained to convergence. The semantic segmentation model is based on an improved RandLA-Net network, which consists of an input layer, an output layer, five encoder layers, five decoder layers, a bilateral enhancement module, and a bottleneck layer. The bilateral enhancement module is located between the encoder layer and the bottleneck layer. The picking and positioning module 1400 is configured to allow the picking robot to determine the picking position of the mature citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and drive its robotic arm to pick the mature citrus fruit according to the picking position, thereby completing the picking and positioning of the mature citrus fruit.
[0181] Based on any embodiment of this application, please refer to Figure 12 Another embodiment of this application also provides an electronic device, which can be implemented by a computer device, such as... Figure 12 The diagram shows the internal structure of a computer device. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected via a system bus. The computer-readable storage medium stores an operating system, a database, and computer-readable instructions. The database may store a sequence of control information. When the computer-readable instructions are executed by the processor, they enable the processor to implement a method for citrus fruit harvesting and positioning. The processor provides computing and control capabilities, supporting the operation of the entire computer device. The memory stores computer-readable instructions, which, when executed by the processor, enable the processor to execute the citrus fruit harvesting and positioning method of this application. The network interface of the computer device is used for communication with a terminal. Those skilled in the art will understand that… Figure 12The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0182] In this embodiment, the processor is used to execute... Figure 11 The system contains the specific functions of each module and its sub-modules. The memory stores the program code and various data required to execute these modules or sub-modules. The network interface is used for data transmission between the user terminal and the server. In this embodiment, the memory stores the program code and data required to execute all modules / sub-modules in the citrus fruit picking and positioning device of this application. The server can call the server's program code and data to execute the functions of all sub-modules.
[0183] This application also provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the citrus fruit picking and positioning method described in any embodiment of this application.
[0184] This application also provides a computer program product, including a computer program / instructions that, when executed by one or more processors, implement the steps of the citrus fruit picking and positioning method described in any embodiment of this application.
[0185] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. This computer program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. The aforementioned storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0186] The above description is only a partial embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications should also be considered within the scope of protection of this application.
[0187] In summary, the citrus fruit harvesting and positioning method of this application lays a solid theoretical foundation for the automated harvesting of mature fruits through precise and rapid positioning, and greatly saves manpower and material resources.
Claims
1. A method for locating and harvesting citrus fruits, characterized in that, include: In response to the citrus fruit picking and positioning command, images of citrus trees from various angles are acquired, and a preset colmap algorithm is called to extract and match feature points from the citrus tree images from various angles to generate camera pose data corresponding to the citrus tree images. Using a neural radiation field model trained to convergence, implicit 3D representations of citrus trees are learned from images of citrus trees at various angles based on the camera pose data, so as to generate 3D point cloud data corresponding to the citrus trees. Based on a semantic segmentation model trained to convergence, end-to-end semantic segmentation is performed on the 3D point cloud data corresponding to the citrus trees to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits. The basic network architecture of the semantic segmentation model is an improved RandLA-Net network, which consists of an input layer, an output layer, 5 encoder layers, 5 decoder layers, a bilateral enhancement module, and a bottleneck layer. The bilateral enhancement module is located between the encoder layer and the bottleneck layer. The harvesting robot determines the harvesting location of the mature citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and drives its robotic arm to harvest the mature citrus fruit according to the harvesting location, thereby completing the harvesting and positioning of the mature citrus fruit.
2. The method for locating and harvesting citrus fruits according to claim 1, characterized in that, The steps of learning an implicit 3D representation of a citrus tree from images of citrus trees at various angles based on the camera pose data using a neural radiation field model trained to convergence, in order to generate corresponding 3D point cloud data for the citrus tree, include: The images of citrus trees from various angles, along with their corresponding camera pose data, are input into a neural radiation field model that has been trained to convergence. The scene is then densely sampled, and a multilayer perceptron is used to predict the volume density and view-dependent emissivity at each location to generate high-quality images.
3. The method for locating and harvesting citrus fruits according to claim 1, characterized in that, The steps of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits, include: In the encoder layer, each layer contains an extended residual block, which processes the input features and downsamples the features by sharing a multilayer perceptron, a local spatial coding module, and an attention pooling module.
4. The method for locating and harvesting citrus fruits according to claim 3, characterized in that, The steps of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits, include: In the expansion residual block, input features (N, d in ) are obtained, where N is the number of points, d in is the feature dimension of each point, and the input point features are converted into (N, d in / 2) by a shared multi-layer perceptron. Local spatial encoding: Features (N, d) processed by the shared multilayer perceptron in / 2) Combine the three-dimensional coordinates (N,3) of each point for processing to generate a shape of (N,d) in Features of ) The feature (N, d) output by the local space coding module in First, the attention score is calculated through the attention pooling module, the calculated attention score is used to weight the feature, and a new feature (N, d in ) is generated; The input features (N,d) in ) is transformed into (N,d) through a shared multilayer perceptron. in / 2), as the shortcut path; Residual connection and activation function: The features of the main path and the shortcut path are added together, and then the Leaky ReLU activation function is applied to output the final features (N, 2d). in ).
5. The method for locating and harvesting citrus fruits according to claim 1, characterized in that, The steps of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits, include: In the bilateral enhancement module, the relative position encoding module calculates the relative position and distance between a point and its neighboring points, and uses the relative position and distance between the point and its neighboring points as features. The calculation formula includes: relative xyz =xyz tile -neighbor xyz , relative feature =concat(relative dis ,relative xyz ,xyz tile ,neighbor xyz ), Among them, relative xyz This represents the relative position of each point with its neighboring points, neighbor xyz Represents the coordinates of neighboring points, xyz tile Obtained by copying xyz and neighbor xyz Matrices of the same shape, where xyz represents the three-dimensional coordinates of a point, and relative... dis It is the Euclidean distance between each point and its neighbors, relative. feature It is a combination of relative distance, relative position, original coordinates, and coordinates of neighboring points; In the local feature aggregation module of the bilateral enhancement module, feature fusion and enhancement are achieved by combining relative position features and neighborhood point features. The calculation formula includes: f concat1 =concat(f neighbours ,f xyz1 ), f pcagg1 =attention pooling (f concat1 ), f concat2 =concat(f neighbours ,f xyz2 ), f pcagg2 =attention pooling (f concat2 ), Among them, f concat1 It is the feature f of the neighborhood points neighbours and relative position features f xyz1 splicing characteristics, f pcagg1 It is through attention pooling pooling For f concat1 Features after weighted aggregation; In the attention pooling module of the bilateral enhancement module, the attention weight of each point to its neighboring points is calculated, and the features of the neighboring points are weighted and averaged to further enhance the feature representation. The calculation formula includes: to activation =dense(f reshaped ), to scores =softmax(to activation ), f agg =∑(f reshaped *to scores ), Among them, att activation The reshaped feature f reshaped The activation value, att, obtained by performing a fully connected operation scores Attention activation value (att) activation The attention weights, f, are obtained by performing the softmax activation function operation. agg By reshaping the feature f reshaped According to attention weights att scores Aggregated features obtained by weighted averaging; The expanded residual block, combined with the local feature aggregation module and the attention pooling module, enhances and propagates features. Its calculation formula includes: f pc1 =conv2d(feature), shortcut = conv2d(feature), output=leaky relu (f pc2 +shortcut), Among them, f pc1 These are intermediate features obtained by performing a convolution operation on the input features. Building by aggregating local features block For f pc1 The features obtained through processing are called shortcuts, which are shortcut connection features obtained by performing convolution operations on the input features. The output is obtained by processing f. pc2 The final output feature is obtained by adding the shortcut and applying the leaky ReLU activation function.
6. The method for locating and harvesting citrus fruits according to claim 1, characterized in that, The steps of performing end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model trained to convergence, in order to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits, include: The improved RandLA-Net network is trained using a pre-defined training set, where each point in the training set has been labeled as a fruit or a non-fruit part. During training, the improved RandLA-Net network learns to identify and classify different fruit tree components from the local structural features of the point cloud through its random sampling and local feature aggregation mechanism. After training, the preprocessed fruit tree point cloud data is input into the improved RandLA-Net network for semantic segmentation. The improved RandLA-Net network outputs the semantic label of each point, thereby distinguishing between mature citrus fruits, immature citrus fruits, and citrus tree parts.
7. The method for locating and harvesting citrus fruits according to claim 6, characterized in that, After the harvesting robot determines the harvesting location of the ripe citrus fruit based on the corresponding 3D point cloud data and 3D spatial coordinates, and drives its robotic arm to harvest the ripe citrus fruit according to the harvesting location, the process includes: In response to citrus tree pruning instructions, determine the corresponding 3D point cloud data of the citrus trees; Based on the three-dimensional point cloud data corresponding to the citrus trees, the corresponding positions of citrus branches and trunks are determined. Based on the corresponding positions of the citrus branches, the citrus trees are pruned to complete the harvesting and positioning of the mature citrus fruits.
8. A citrus fruit picking and positioning device, characterized in that, include: The camera pose determination module is configured to respond to the citrus fruit picking and positioning command, acquire citrus tree images from various angles, call the preset colmap algorithm to extract feature points from the citrus tree images from various angles and match them to generate camera pose data corresponding to the citrus tree images. The fruit tree point cloud determination module is configured to use a neural radiation field model trained to convergence to learn the implicit three-dimensional representation of citrus fruit trees from the citrus fruit tree images from various angles based on the camera pose data, so as to generate the corresponding three-dimensional point cloud data of the citrus fruit trees. The semantic segmentation module is configured to perform end-to-end semantic segmentation on the 3D point cloud data corresponding to the citrus trees based on a semantic segmentation model that has been trained to convergence, so as to extract the 3D point cloud data and 3D spatial coordinates corresponding to the mature citrus fruits. The basic network architecture of the semantic segmentation model is an improved RandLA-Net network, which consists of an input layer, an output layer, 5 encoder layers, 5 decoder layers, a bilateral enhancement module, and a bottleneck layer. The bilateral enhancement module is located between the encoder layer and the bottleneck layer. The picking and positioning module is configured to allow the picking robot to determine the picking location of the ripe citrus fruit based on the corresponding three-dimensional point cloud data and three-dimensional spatial coordinates, and drive its robotic arm to pick the ripe citrus fruit according to the picking location, thereby completing the picking and positioning of the ripe citrus fruit.
9. An electronic device comprising a central processing unit and a memory, characterized in that, The central processing unit is used to invoke and run a computer program stored in the memory to perform the steps of the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, It stores, in the form of computer-readable instructions, a computer program implemented according to any one of claims 1 to 7, which, when invoked by a computer, executes the steps included in the corresponding method.