An apple intelligent sorting system and method based on deep learning visual detection

By introducing a camera-light source-encoder synchronous triggering device and a deep learning model into the apple sorting system, accurate synchronous acquisition and multi-dimensional detection of multi-view images were achieved, solving the problems of inaccurate positioning and inconsistent timestamps in the existing technology, and improving the accuracy and stability of sorting.

CN122244489APending Publication Date: 2026-06-19HENGMEI COLOR IND (TIANJIN) PRINTING CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HENGMEI COLOR IND (TIANJIN) PRINTING CO LTD
Filing Date
2026-01-23
Publication Date
2026-06-19

Smart Images

  • Figure CN122244489A_ABST
    Figure CN122244489A_ABST
Patent Text Reader

Abstract

This invention relates to the field of image processing technology and discloses an intelligent apple sorting system and method based on deep learning visual detection. The method includes: arranging at least one set of camera-light source-encoder synchronous triggering devices on a conveyor belt; acquiring multi-view images based on encoder pulses; preprocessing the images to obtain standardized input data; inputting the standardized input data into a trained deep learning model; parsing the output of the deep learning model to obtain instance segmentation results, defect segmentation results, appearance parameter results, and grade classification results; estimating the three-dimensional pose of the apples to determine their trajectory information; generating sorting control parameters corresponding to the apples based on the grade classification results, defect distribution information, and trajectory information; and sending the sorting control parameters to the sorting execution mechanism. This application improves image quality and comparability, enhances the accuracy and interpretability of the parsing results, and improves the accuracy of sorting.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and more specifically, to an intelligent apple sorting system and method based on deep learning visual detection. Background Technology

[0002] With the development of the fruit industry towards large-scale and intensive operations, apples, as a typical cash crop, directly impact their commodity value and distribution efficiency through grading and sorting quality. Traditional manual sorting methods are inefficient and unstable, failing to meet the demands of modern fruit processing enterprises for high-precision, large-volume, and continuous operations. In recent years, machine vision-based sorting technology has been gradually applied to apple sorting. Typical solutions usually include image acquisition devices, image processing algorithms, and sorting execution mechanisms, achieving grading and sorting through the detection of appearance, color, size, and surface defects.

[0003] However, existing machine vision-based apple sorting solutions still have several shortcomings. Current commonly used technologies primarily rely on edge detection and Hough circle detection methods for localization. While these methods can extract the geometric center of the fruit, their accuracy decreases under conditions of high light, occlusion, or changes in fruit posture. Furthermore, the Hough transform computation is computationally intensive, making it difficult to meet the real-time requirements of high-speed sorting lines. Regarding grading methods, existing solutions typically introduce improved LeNet-5 convolutional neural networks to enhance recognition accuracy. However, these methods mainly focus on overall grade classification, lacking multi-dimensional joint modeling of defect types, defect distribution areas, and appearance parameters. This results in insufficient interpretability and weak integration with robot grasping and control. In addition, existing solutions often employ single-camera or simple multi-angle imaging methods, lacking hardware synchronization triggering and unified timestamp mechanisms based on conveyor belt encoders. This leads to temporal inconsistencies between multi-view images, affecting the reliability of 3D pose estimation and trajectory prediction.

[0004] Therefore, it is necessary to design an Apple intelligent sorting system and method based on deep learning visual detection to solve the problems existing in the current technology. Summary of the Invention

[0005] In view of this, the present invention proposes an intelligent sorting system and method for apples based on deep learning visual detection, aiming to solve the problems of low temporal consistency in multi-view synchronous acquisition, low accuracy of 3D pose and trajectory, and lack of closed-loop feedback in sorting control.

[0006] In one aspect, this invention proposes an intelligent apple sorting method based on deep learning visual detection, comprising: At least one set of camera-light source-encoder synchronous triggering device is arranged on the conveyor belt. The multi-camera is hardware-triggered based on encoder pulses and multi-view images of the same apple are acquired with a unified timestamp. The multi-view images are preprocessed to obtain standardized input data. The standardized input data is input into a trained deep learning model, and the output of the deep learning model is analyzed to obtain instance segmentation results, defect segmentation results, appearance parameter results, and grade classification results corresponding to the apple. Based on the instance segmentation results and key point detection results, the three-dimensional pose of the apple is estimated, and the trajectory information of the apple is determined by combining the time information of the transportation process. Based on the grade classification results, defect distribution information, and trajectory information, sorting control parameters corresponding to apples are generated. The sorting control parameters are sent to the sorting execution mechanism, and the sorting results and quality feedback data are stored in the database after sorting is completed.

[0007] Furthermore, at least one set of camera-light source-encoder synchronous triggering devices is arranged on the conveyor belt. When hardware triggering of multiple cameras based on encoder pulses and acquiring multi-view images of the same apple with a unified timestamp, the following is included: Using the encoder pulse as the hardware trigger reference, the same trigger frequency division coefficient and unified timestamp generation rule are set for each camera, so that the same apple can obtain multi-view images with the same unified timestamp identifier from different camera positions. The brightness control of the light source is bound to the hardware trigger, so that the light source maintains a constant output during camera exposure and is turned off outside the exposure; wherein, the multi-view images of the same apple meet the preset field-of-view overlap ratio and minimum baseline distance.

[0008] Furthermore, when preprocessing the multi-view images to obtain standardized input data, the process includes: For each frame of image, geometric distortion correction, color calibration, and illumination equalization are performed sequentially, and blurry and overexposed frames are removed based on quality indicators corresponding to a unified timestamp. The multi-view images that pass the quality screening are normalized in size and pixels according to a fixed view order and fixed size, and the output is standardized input data with a unified size, unified channel order, and unified timestamp identifier.

[0009] Furthermore, when inputting the standardized input data into the trained deep learning model and parsing the output of the deep learning model, the process includes: A multi-branch structure with a shared feature extraction network is adopted, wherein the instance segmentation branch outputs the instance mask and confidence map corresponding to the apple, the defect segmentation branch outputs the defect category probability map, the appearance parameter regression branch outputs the appearance parameter results of color coverage, maximum diameter and shape index, the grade classification branch outputs the grade classification probability vector, and the key point detection branch outputs the key point heat map or key point coordinates. During parsing, the effective area is defined by the instance mask. The target instance segmentation result is obtained by filtering with confidence threshold and selecting connected components. The defect category probability map is thresholded within the instance mask to obtain the defect segmentation result. The appearance parameter result is calculated within the instance mask based on pixel statistics and geometric measures. Thresholding and non-maximum suppression are applied to key point heatmaps or key point coordinates within the instance mask to generate key point detection results; The classification result is taken from the highest probability category of the classification probability vector. When there is an inconsistency between the classification result and the appearance parameter result or the defect segmentation result, the confidence level is adjusted according to the preset consistency rule before the final classification result is output.

[0010] Furthermore, for the multi-view output results of the same apple, the analysis also includes: aligning the multi-view results according to a unified timestamp, and assigning quality weights to each view according to image quality indicators. The instance masks from each viewpoint are fused using quality-weighted fusion to obtain the fused instance segmentation result. The defect category probability maps from each viewpoint are then averaged using quality-weighted fusion and thresholded within the fused instance segmentation result to obtain the fused defect segmentation result. The appearance parameter results from each viewpoint are weighted and statistically analyzed according to quality weights to obtain the fused appearance parameter results. The level classification probability vectors from each viewpoint are then averaged using quality-weighted fusion, and the class with the highest probability is taken as the fused level classification result. When the quality index of any viewpoint is lower than the preset quality index threshold, the output corresponding to that viewpoint is removed from the fusion process.

[0011] Furthermore, when estimating the 3D pose of the apple based on the instance segmentation results and keypoint detection results, the process includes: The effective region is defined by the instance segmentation results, and the key point detection results are read within the effective region to obtain the two-dimensional key points from each viewpoint; Based on a unified timestamp, 2D key points of the same apple from different perspectives are paired, and spatial triangulation is performed in combination with camera parameters and inter-camera coordinate transformation matrix to obtain 3D key points. Based on the three-dimensional key points, the three-dimensional pose parameters are solved in a way that minimizes the reprojection error, and the position and attitude parameters are output. The three-dimensional pose parameters are transformed into the conveyor belt coordinate system through a pre-calibrated coordinate transformation matrix as a unified representation of the three-dimensional pose.

[0012] Furthermore, when determining the apple's trajectory information by combining the time information of the transportation process, this includes: The 3D poses sorted by a unified timestamp are time-aligned with the displacement and velocity information provided by the encoder to establish a motion model based on the conveyor belt motion. The 3D poses are recursively estimated to form a position-time series as trajectory information. When occlusion or frame loss occurs, interpolation is performed based on the effective 3D pose and motion model within the most recent preset time period, and abnormal 3D poses are removed after consistency detection.

[0013] Furthermore, when generating sorting control parameters corresponding to apples based on the grade classification results, defect distribution information, and trajectory information, the following steps are included: Determine the target sorting location based on the classification results; The effective contact area is obtained by eliminating defective segmentation results from the instance segmentation results. Within the effective contact area, the gripping point is selected by combining the three-dimensional pose and the corresponding gripping pose is calculated. Set clamping parameters and safety boundaries based on appearance parameter results and defect distribution information; Calculate timing parameters based on trajectory information and verify arrival time windows; The grasping pose, gripping parameters, timing parameters, and target sorting position are combined into sorting control parameters. When the time window is reached but the preset conditions are not met, a recalculation or skipping strategy is executed before the sorting control parameters are output.

[0014] Furthermore, when sending the sorting control parameters to the sorting execution mechanism and storing the sorting results and quality feedback data in the database after sorting is completed, the process includes: Send sorting control parameters through the communication interface and receive execution readiness confirmation; During the sorting process, quality feedback data is collected, including successful grasping indicators, clamping status parameters, pose tracking residuals, and placement location confirmation. After sorting is completed, the sorting results and the quality feedback data are written to the database according to a unified timestamp, and the corresponding sorting control parameter summary is recorded. When no execution readiness confirmation is received or a grabbing failure is detected, a preset retry or abort strategy is triggered before writing.

[0015] Compared with existing technologies, the advantages of this invention are as follows: By arranging a camera-light source-encoder synchronous triggering device on the conveyor belt, and using encoder pulses as a reference, hardware triggering and unified timestamp control of multi-camera positions are achieved, ensuring strict temporal consistency in the acquisition of multi-view images of the same apple. Furthermore, by combining preprocessing techniques such as geometric distortion correction, color calibration, and illumination equalization, standardized input data is generated, improving image quality and comparability from the source. Based on this, the standardized input data is input into a trained deep learning model, which simultaneously outputs instance segmentation results, defect segmentation results, appearance parameter results, and classification results during a single inference process. Defects are represented using instance masks as defined regions. Multi-dimensional joint modeling of detection and appearance parameter calculation enhances the accuracy and interpretability of the analysis results; 3D pose estimation is performed based on instance segmentation results and key point detection results, and trajectory information is obtained by combining the time information of the conveying process, providing a basis for subsequent grasping timing prediction; sorting control parameters corresponding to apples are generated according to the grade classification results, defect distribution information and trajectory information, and sent to the sorting execution mechanism to realize sorting. At the same time, after sorting is completed, the sorting results and quality feedback data are stored in the database, realizing a closed-loop optimization process of identification-positioning-control-execution-feedback, thereby improving the accuracy, stability and adaptability of sorting while ensuring the cycle time of the high-speed sorting line.

[0016] On the other hand, this application also provides an intelligent apple sorting system based on deep learning visual detection, for applying the above-mentioned intelligent apple sorting method based on deep learning visual detection, including: The acquisition unit includes at least one set of camera-light source-encoder synchronous triggering devices; The preprocessing unit is configured to implement hardware triggering of a multi-camera system based on encoder pulses and acquire multi-view images of the same apple with a unified timestamp, and to preprocess the multi-view images to obtain standardized input data. The parsing unit is configured to input the standardized input data into a trained deep learning model, parse the output of the deep learning model, and obtain instance segmentation results, defect segmentation results, appearance parameter results, and grade classification results corresponding to the apple. The processing unit is configured to estimate the three-dimensional pose of the apple based on the instance segmentation result and the key point detection result, and determine the trajectory information of the apple by combining the time information of the transportation process. The control unit is configured to generate sorting control parameters corresponding to the apples based on the grade classification results, defect distribution information, and trajectory information; the control unit is also configured to send the sorting control parameters to the sorting execution mechanism, and store the sorting results and quality feedback data in the database after sorting is completed.

[0017] It is understandable that the above-mentioned Apple intelligent sorting method and system based on deep learning visual detection have the same beneficial effects, and will not be elaborated further here. Attached Figure Description

[0018] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings: Figure 1 A flowchart of an intelligent apple sorting method based on deep learning visual detection provided in an embodiment of the present invention; Figure 2 This is a functional block diagram of an Apple intelligent sorting system based on deep learning visual detection, provided in an embodiment of the present invention. Detailed Implementation

[0019] Exemplary embodiments of the present disclosure will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the disclosure to those skilled in the art. It should be noted that, unless otherwise specified, embodiments and features in the embodiments of the present invention can be combined with each other. The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0020] In existing apple sorting systems, time synchronization discrepancies in multi-view image acquisition lead to accumulated errors in 3D pose estimation. Simultaneously, single-dimensional detection models struggle to jointly optimize instance segmentation, defect localization, and appearance parameters, impacting the accuracy of sorting control parameter generation. For instance, on high-speed sorting lines, when apples pass through a multi-camera array at a linear speed of 1.2 meters per second, the timestamp differences in multi-view images of the same apple due to trigger signal delays from different cameras exceed 10 milliseconds. This results in spatial coordinate errors of over ±3 millimeters for 3D keypoints, further causing trajectory prediction errors. When using single-branch convolutional neural networks to handle multi-dimensional detection tasks, instance masks and defect regions experience pixel-level boundary conflicts, and color coverage calculations are affected by illumination reflection, leading to a deviation of over 15% between the classification results and measured physical parameters.

[0021] If the above issues are not addressed, 3D pose errors will cause the robotic arm's gripping point to deviate from the actual contact area, resulting in damage to the apple surface or gripping failure. The multi-task coupling defects of the single-dimensional detection model will lead to misjudgment of sorting levels, reducing the accuracy of screening high-value fruits. Inconsistent timestamps will cause trajectory information breaks in multi-machine collaborative scenarios, leading to conflicting actions or timing disorders in the sorting execution mechanism, increasing system failure rate and maintenance costs.

[0022] For this, please refer to Figure 1 As shown, this application proposes an intelligent apple sorting method based on deep learning visual detection, including: S100: At least one set of camera-light source-encoder synchronous triggering devices are arranged on the conveyor belt. The multi-camera is hardware-triggered based on encoder pulses and multi-view images of the same apple are acquired with a unified timestamp. The multi-view images are preprocessed to obtain standardized input data. S200: Input standardized input data into a trained deep learning model, parse the output of the deep learning model, and obtain instance segmentation results, defect segmentation results, appearance parameter results, and grade classification results corresponding to the apple. S300: Based on instance segmentation results and key point detection results, the three-dimensional pose of the apple is estimated, and the trajectory information of the apple is determined by combining the time information of the transportation process. S400: Generate sorting control parameters corresponding to apples based on the grade classification results, defect distribution information, and trajectory information; S500: Sends sorting control parameters to the sorting execution mechanism and stores the sorting results and quality feedback data in the database after sorting is completed.

[0023] Specifically, the camera-light source-encoder synchronous triggering device refers to a hardware combination where the encoder pulses trigger the camera and light source to work together. Specifically, the encoder outputs pulse signals to the camera and light source controllers, synchronizing camera exposure with light source brightness changes, thus achieving temporal consistency in multi-view image acquisition. This device solves the problem of 3D reconstruction errors caused by temporal asynchrony in multi-view images. Unified timestamp acquisition assigns the same time stamp to images of the same apple from different camera perspectives. This is achieved by generating a unified timestamp when the encoder pulses trigger the camera, ensuring temporal alignment of multi-view images. This mechanism avoids trajectory prediction deviations caused by temporal misalignment in multi-view images. Standardized input data refers to image data with a unified format after preprocessing. Specifically, geometric distortion correction, color calibration, and size normalization methods are used to eliminate the impact of imaging differences on model inference. This processing improves the generalization ability of deep learning models for images from different perspectives. Instance segmentation results refer to the individual apple regions segmented by the deep learning model. Specifically, a mask prediction network outputs the pixel-level contours of each apple, solving the localization failure problem of traditional edge detection under highlights or occlusion. The defect segmentation result refers to the pixel-level classification of defect areas on the apple's surface. Specifically, a semantic segmentation network can be used to identify defect types, achieving accurate assessment of defect distribution and severity. The 3D pose estimation refers to calculating the apple's position and orientation in space. Multi-view keypoint triangulation and reprojection optimization methods can be used to overcome the vulnerability of single-view pose estimation to pose changes. The sorting control parameters refer to the set of instructions that drive the actuator to complete the sorting process. These parameters can include grasping pose, clamping parameters, and timing parameters, ensuring precise matching between the sorting action and the apple's movement trajectory.

[0024] This application achieves multi-dimensional joint modeling of apple appearance, defects, pose and motion trajectory by synchronously triggering multi-view image acquisition through encoder pulses and combining multi-task deep learning models with three-dimensional pose estimation. This solves the problem of insufficient sorting accuracy caused by temporal misalignment, single-view limitation and fragmentation of multi-dimensional information in traditional solutions.

[0025] The working process and principle of this application are as follows: A synchronous triggering device consisting of cameras, light sources, and encoders is arranged on a conveyor belt. Encoder pulses are used to hardware trigger multiple cameras, ensuring that multi-view images of the same apple are acquired with a unified timestamp. These images are preprocessed to obtain standardized input data, which is then input into a pre-trained deep learning model. The model analyzes the input data and outputs apple instance segmentation, defect segmentation, appearance parameters, and grade classification results.

[0026] Based on instance segmentation and keypoint detection results, the 3D pose of the apple is estimated, and its trajectory is determined by combining the time information of the conveying process. Sorting control parameters are generated based on grade classification, defect distribution, and trajectory information, and these parameters are sent to the sorting execution mechanism. After sorting is completed, the sorting results and quality feedback data are stored in the database.

[0027] Understandably, the hardware synchronization triggering and unified timestamp mechanism solve the time consistency problem of multi-view image acquisition and improve the accuracy of 3D pose estimation. The multi-task output of the deep learning model achieves joint optimization of instance segmentation, defect detection, and parameter estimation, avoiding the problem that a single model cannot handle multi-dimensional tasks. The generation of trajectory information takes into account the conveyor belt motion model, improving the accuracy and reliability of sorting control.

[0028] As a preferred embodiment, the solution of this application is specifically implemented as follows: Multiple sets of camera-light source-encoder synchronous triggering devices are installed on the conveyor belt. Each set includes a high-speed industrial camera, an LED structured light source, and a high-precision incremental encoder. The encoder is connected to the conveyor belt drive shaft and generates pulse signals as a trigger reference. The camera is connected to the encoder via a hardware interface to receive the trigger signals. The light source is driven by the camera control signals and is synchronized with the exposure.

[0029] A uniform trigger frequency division factor is set, for example, image acquisition is triggered once every 100 encoder pulses. The camera uses a global shutter mode, and the exposure time synchronization is set to 1 millisecond. During image acquisition, encoder pulses trigger multiple cameras to expose simultaneously, generating multi-view images with the same timestamp.

[0030] The acquired raw images undergo geometric distortion correction, color calibration, and illumination equalization. Blurry and overexposed frames are removed, and the qualified images are rearranged in a fixed viewing angle order and adjusted to a uniform size, such as 1024x1024 pixels. Pixel values ​​are normalized to the range of 0-1 to obtain standardized input data.

[0031] The deep learning model employs a multi-branch structure, including a shared feature extraction network and multiple task branches. The instance segmentation branch outputs a binary mask and a confidence map, the defect segmentation branch outputs a multi-class probability map, the appearance parameter branch regresses color coverage and size parameters, the rating classification branch outputs a probability vector, and the keypoint detection branch outputs a heatmap.

[0032] The analytical model output is used to obtain instance masks through thresholding and connected component analysis. Defect segmentation and parameter calculation are then performed within these masks. Keypoints are obtained through non-maximum suppression. The classification results are then assigned to the class with the highest probability, and consistency checks are performed with other results.

[0033] Based on multi-view keypoints, triangulation is performed to obtain three-dimensional coordinates. The position and orientation of the apple are estimated by minimizing reprojection error. Combined with displacement and velocity information provided by the encoder, a motion model is established to predict the apple's trajectory.

[0034] The target sorting location is determined based on the classification level. Within the effective contact area, a gripping point is selected, and the gripping posture is calculated. Clamping parameters are set based on appearance parameters and defect distribution, and timing parameters are calculated based on trajectory information. These parameters are combined to form sorting control commands, which are then sent to the actuator.

[0035] During the sorting process, quality feedback data such as grasping success rate and clamping status are collected. After sorting is completed, the sorting results, quality data, and control parameters are written into the database for subsequent analysis and optimization.

[0036] Through the above-described scheme, this application achieves precise synchronous acquisition of multi-view images, improving the accuracy of 3D pose estimation. A multi-task deep learning model enables joint optimization of instance segmentation, defect detection, and parameter estimation, enhancing the accuracy and consistency of hierarchical classification. Motion model-based trajectory prediction improves the precision and reliability of sorting control, reducing grasping failures and fruit damage. The collection and storage of quality feedback data provides a basis for continuous optimization, improving the overall efficiency and quality of intelligent apple sorting.

[0037] This application further proposes arranging at least one set of camera-light source-encoder synchronous triggering devices on a conveyor belt. When hardware triggering is performed on multiple cameras based on encoder pulses and multi-view images of the same apple are acquired with a unified timestamp, the method includes: using the encoder pulse as the hardware triggering reference, setting the same triggering frequency division coefficient and unified timestamp generation rule for each camera, so that the same apple can obtain multi-view images with the same unified timestamp identifier at different camera positions; binding the brightness control of the light source to the hardware triggering, so that the light source maintains a constant output during camera exposure and is turned off outside of exposure; wherein, the multi-view images of the same apple satisfy the preset field of view overlap ratio and minimum baseline distance.

[0038] The trigger frequency division coefficient controls the camera's trigger frequency by adjusting the encoder pulse counting interval, for example, triggering once every ten pulses to ensure that multiple cameras synchronously acquire images within the same time interval. The unified timestamp generation rule combines the absolute count value of the encoder pulses with the microsecond-level timestamp of the system clock to generate a globally unique and alignable time identifier. Light source brightness control directly drives the light source drive circuit via a hardware trigger signal, maintaining constant current during camera exposure and immediately cutting off current after exposure to reduce energy consumption. The field-of-view overlap ratio is adjusted by changing the camera mounting angle and focal length to ensure that the field of view of adjacent cameras covers at least 60% of the apple's surface; the minimum baseline distance is ensured by fixing the physical spacing between cameras, guaranteeing that the optical axis distance between adjacent cameras is greater than one-third of the apple's diameter.

[0039] Specifically, when the encoder pulse serves as the hardware trigger reference, each camera receives the same encoder signal, and the trigger division factor is set to the same value. For example, if the division factor is five, then acquisition is triggered every five pulses. A unified timestamp generation rule concatenates the encoder pulse count with a millisecond-level timestamp from a high-precision clock. For example, if the pulse count is 123 and the system clock is 16:30:52:200, the generated timestamp is "163005200_123". Light source brightness control is synchronously activated via a trigger signal, maintaining constant brightness during camera exposure. For example, if the exposure time is 500 microseconds, the light source maintains stable brightness for 500 microseconds and is immediately turned off after exposure to reduce ambient light interference. The field-of-view overlap ratio is adjusted by controlling the camera's pitch and horizontal deflection angles to ensure that the apple surface area captured by adjacent cameras at least covers the top and side boundary areas of the same apple. The minimum baseline distance is fixed by the mechanical limiting structure of the mounting bracket, ensuring that the optical axis distance between adjacent cameras is 10 centimeters, guaranteeing that the baseline distance meets stereo matching requirements during 3D reconstruction. Thus, multi-view images achieve precise alignment in both time and space, providing reliable input for subsequent 3D pose estimation.

[0040] As a preferred embodiment, the solution of this application is specifically implemented as follows: At least one set of camera-light source-encoder synchronous triggering devices is arranged on the conveyor belt. Encoder pulses are used as the hardware trigger reference, and the same trigger frequency division coefficient and unified timestamp generation rule are set for each camera. For example, the trigger frequency division coefficient can be set to 100, meaning that the camera is triggered to acquire data once every 100 encoder pulses. The unified timestamp generation rule can be a combination of the system time in milliseconds and the encoder count value. This allows the same apple to be captured in multi-view images with the same unified timestamp identifier from different camera positions.

[0041] The brightness control of the light source is bound to hardware triggering. Specifically, the light source maintains a constant output during camera exposure and is turned off outside of exposure. The light source activation time can be synchronized with the camera exposure time by setting the delay parameter of the light source controller. Multi-view images of the same apple meet preset field-of-view overlap ratios and minimum baseline distances. Furthermore, the field-of-view overlap ratio can be set to over 50%, and the minimum baseline distance can be set to 100mm to ensure sufficient information redundancy and parallax between multi-view images.

[0042] Through the above technical solutions, this application achieves precise synchronous triggering and unified timestamp identification for multi-camera setups, ensuring temporal consistency of images of the same apple from different viewpoints. This improves the accuracy of subsequent 3D pose estimation and trajectory prediction. Simultaneously, the synchronized control of light source and camera exposure reduces unnecessary energy consumption and extends the lifespan of the light source. Furthermore, the rationally set field-of-view overlap ratio and baseline distance provide a solid foundation for multi-view information fusion and 3D reconstruction.

[0043] This application further proposes to perform geometric distortion correction, color calibration and illumination equalization on each frame of image in sequence, and to remove blurry and overexposed frames based on quality indicators corresponding to a unified timestamp; and to normalize the size and pixels of the multi-view images that have passed the quality screening according to a fixed view order and a fixed size, and output standardized input data with a unified size, a unified channel order and a unified timestamp identifier.

[0044] Geometric distortion correction uses a pre-calibrated camera intrinsic parameter matrix and distortion coefficients to perform inverse mapping correction on the image, eliminating barrel or pincushion distortion. Color calibration establishes a color conversion matrix by photographing a standard color chart, converting the original image to a standard color gamut space. Illumination equalization uses a multi-scale Retinex algorithm to decompose the image's brightness and reflection layers, and performs histogram matching on the brightness layer to eliminate illumination differences. Quality indicators include image sharpness evaluation function values ​​and pixel brightness distribution statistics, with valid frames selected by setting threshold ranges. Size normalization uses bilinear interpolation to scale the image to a fixed resolution, and pixel normalization linearly maps pixel values ​​to a preset numerical range.

[0045] Specifically, when the multi-view images acquired simultaneously enter the preprocessing stage, each frame is first geometrically corrected to eliminate edge stretching caused by wide-angle lenses, ensuring the geometric authenticity of the apple's outline. Then, a color calibration module adjusts images acquired by different cameras to a unified color gamut, avoiding color misjudgments caused by differences in light source color temperature. Illumination equalization further eliminates brightness differences caused by dynamic exposure or local reflections, enhancing the contrast of defect areas. The quality screening stage automatically removes frames with substandard image quality based on sharpness and exposure metrics, retaining only valid viewpoint data. Finally, size and pixel normalization converts the multi-view images into a unified format, forming a tensor structure that can be directly input into deep learning models. For example, in the illumination equalization step, a multi-scale Gaussian kernel is used to decompose the image reflection layer, preserving defect details while suppressing overexposure in highlight areas, enabling subsequent defect segmentation branches to accurately identify minor flaws. Through this processing flow, multi-view images are standardized in geometric, color, and illumination dimensions, providing high-quality input for multi-branch deep learning models and improving the robustness of instance segmentation and defect detection.

[0046] As a preferred embodiment, the solution of this application is specifically implemented as follows: For each image frame, geometric distortion correction, color calibration, and illumination equalization are performed sequentially. Geometric distortion correction uses a multinomial model-based distortion correction algorithm, color calibration uses a color mapping matrix calibrated by a color chart, and illumination equalization uses an adaptive histogram equalization method. Blurry and overexposed frames are removed based on quality metrics corresponding to a unified timestamp. Quality metrics include image sharpness, contrast, and brightness distribution. Multi-view images that pass the quality screening are then normalized in size and pixel order according to a fixed viewpoint and size. Size normalization scales the image to 512x512 pixels, and pixel normalization maps pixel values ​​to the [0,1] range. The output is standardized input data with a unified size, unified channel order, and unified timestamp identifier. The unified channel order is RGB three channels, and the unified timestamp identifier uses millisecond-level timestamps.

[0047] Through the above technical solutions, this application achieves standardized preprocessing of multi-view apple images. Geometric distortion correction, color calibration, and illumination equalization improve image quality and consistency. Low-quality images are eliminated based on quality indicators, ensuring the input quality for subsequent processing. Size normalization and pixel normalization give images from different viewpoints a unified format, facilitating processing by deep learning models. A unified timestamp ensures temporal consistency of multi-view images, providing a reliable foundation for subsequent 3D pose estimation and trajectory prediction. These preprocessing steps enhance the accuracy and robustness of subsequent visual detection and analysis.

[0048] This application further proposes a method for inputting standardized input data into a trained deep learning model. When parsing the output of the deep learning model, the method includes: employing a multi-branch structure with a shared feature extraction network, where the instance segmentation branch outputs an instance mask and confidence map corresponding to the apple, the defect segmentation branch outputs a defect category probability map, the appearance parameter regression branch outputs appearance parameter results such as color coverage, maximum diameter, and shape index, the grading classification branch outputs a grading classification probability vector, and the keypoint detection branch outputs a keypoint heatmap or keypoint coordinates. During parsing, the instance mask defines the effective region, and the target instance segmentation result is obtained through confidence thresholding and connected component selection. The defect category probability map is thresholded within the instance mask to obtain the defect segmentation result, and the appearance parameter results are calculated within the instance mask based on pixel statistics and geometric measures. Within the instance mask, the keypoint heatmap or keypoint coordinates are thresholded and non-maximum suppression is applied to generate keypoint detection results. The grading classification result is taken from the highest probability category of the grading classification probability vector, and when there is inconsistency between the grading classification result and the appearance parameter result or defect segmentation result, the confidence is adjusted according to a preset consistency rule before the final grading classification result is output.

[0049] The shared feature extraction network uses ResNet-50 as the backbone, and achieves multi-task joint training by connecting five branch networks in parallel. The instance segmentation branch uses a Mask R-CNN structure to generate pixel-level masks, and the defect segmentation branch uses a U-Net structure to output a defect probability distribution map. The appearance parameter regression branch maps feature vectors to three parameters—color coverage, maximum diameter, and shape index—through fully connected layers. Color coverage is calculated by converting to the HSV color space and then statistically determining the proportion of red pixels. The maximum diameter is calculated by fitting the major axis of an ellipse to the instance mask boundary points. The shape index is calculated by the ratio of the square of the mask perimeter to its area. The keypoint detection branch uses a stacked hourglass network to generate a heatmap and locates keypoint coordinates using Gaussian kernel peak values. The consistency rule is set as follows: when the difference between the classification result and the appearance parameter result exceeds a threshold, the classification confidence is reduced by 30%; when the classification result conflicts with the defect area proportion, the defect area proportion is used as a weight to adjust the classification probability.

[0050] Specifically, standardized input data is processed by a shared feature extraction network to generate a unified feature map, which is then processed synchronously by each branch network. The instance segmentation branch outputs a mask that defines the main apple region, and the defect segmentation branch only performs defect detection within this region to avoid background interference. The appearance parameter regression branch calculates parameters based on the effective pixels within the mask; for example, color coverage only considers the proportion of red pixels within the mask. The key point detection branch filters heatmap peak points within the mask region to eliminate noise interference from non-target areas. When the classification result is inconsistent with the appearance parameter result, for example, if the apple is classified as Grade 1 but the color coverage is below 85%, the classification confidence is reduced according to preset rules. If the confidence falls below a threshold, a manual review mechanism is triggered. Through dual verification of the results from multiple branches in both the spatial and logical domains, the reliability of sorting decisions is effectively improved.

[0051] As a preferred embodiment, the solution of this application is specifically implemented as follows: Standardized input data is fed into a trained deep learning model, and the model's output is analyzed. Specifically, a multi-branch structure with a shared feature extraction network is used. The instance segmentation branch outputs an instance mask and confidence map corresponding to the apple; the defect segmentation branch outputs a defect category probability map; the appearance parameter regression branch outputs appearance parameter results such as color coverage, maximum diameter, and shape index; the grading classification branch outputs a grading classification probability vector; and the keypoint detection branch outputs a keypoint heatmap or keypoint coordinates.

[0052] Furthermore, during parsing, the effective area is defined by an instance mask, and the target instance segmentation result is obtained by filtering with confidence threshold and selecting connected components. The defect category probability map is thresholded within the instance mask to obtain the defect segmentation result, and the appearance parameter result is calculated within the instance mask based on pixel statistics and geometric measures.

[0053] Therefore, thresholding and non-maximum suppression are performed on the key point heatmap or key point coordinates within the instance mask to generate key point detection results.

[0054] The classification result is taken from the highest probability category of the classification probability vector. When there is an inconsistency between the classification result and the appearance parameter result or the defect segmentation result, the confidence level is adjusted according to the preset consistency rule before the final classification result is output.

[0055] For example, the deep learning model can employ a ResNet50-based feature extraction backbone network, a Mask R-CNN structure for instance segmentation, a DeepLab v3+ structure for defect segmentation, fully connected layers for appearance parameter regression, a softmax classifier for grade classification, and an HRNet structure for keypoint detection. The confidence threshold for instance masks can be set to 0.5, and the connected component selection can retain the connected regions with the largest area. The threshold for the defect category probability map can be set to 0.3. The threshold for the keypoint heatmap can be set to 0.1, and the radius of non-maximum suppression can be set to 3 pixels. Preset consistency rules may include: when the grade classification result is excellent but defect segmentation results exist, adjust the grade classification result to inferior; when the grade classification result is inferior but the appearance parameter results meet the excellent standard, reduce the confidence of the grade classification result by 20%.

[0056] Through the above technical solutions, this application achieves joint modeling and analysis of multi-dimensional features of apples, improving the accuracy and interpretability of classification results. The multi-branch deep learning model can simultaneously output instance segmentation, defect segmentation, appearance parameters, classification levels, and keypoint detection results, avoiding redundant computation from multiple independent models. Instance masks define the effective region, reducing background interference and improving the accuracy of various results. The introduction of pre-defined consistency rules ensures consistency between classification results and other detection results, improving the reliability of classification. Furthermore, keypoint detection results provide a foundation for subsequent 3D pose estimation, which is beneficial for improving the accuracy of sorting operations.

[0057] This application further proposes aligning multi-view results based on a unified timestamp and assigning quality weights to each view according to image quality indicators; fusing instance masks from each view through quality weighted fusion to obtain fused instance segmentation results; performing quality weighted averaging and thresholding on the defect category probability maps of each view within the fused instance segmentation results to obtain fused defect segmentation results; performing weighted statistical fusion of appearance parameter results from each view according to quality weights to obtain fused appearance parameter results; and taking the highest probability category as the fused level classification result after quality weighted averaging of the level classification probability vectors from each view; when the quality indicator of any view is lower than a preset quality indicator threshold, the output corresponding to that view is removed from the fusion process.

[0058] The unified timestamp alignment matches the parsing results from different perspectives using time identifiers, ensuring that multi-view data are based on the same time reference. Quality weight allocation constructs evaluation indicators based on image sharpness, exposure, and noise level, generating weight coefficients through normalization. The weighted fusion process uses a pixel-wise weighted averaging algorithm, retaining the maximum confidence value of overlapping areas when fusing instance masks. Before fusion of the defect probability map, the coordinates of each perspective are mapped to the fusion instance mask space to ensure spatial consistency. Appearance parameter weighting statistics use a linear weighting method; color coverage is taken as the weighted average of coverage from each perspective, and the maximum diameter is taken as the maximum statistical value after weighting. After fusion, the probability distribution of the grade classification probability vector is recalculated using the softmax function. The preset quality index threshold is calibrated experimentally; a rejection mechanism is triggered when the quality index falls below 0.7.

[0059] Specifically, the multi-view analysis results are first time-aligned using a unified timestamp to eliminate temporal discrepancies. The image quality of each viewpoint is quantified by calculating blurriness, exposure uniformity, and signal-to-noise ratio, generating weight values ​​from 0 to 1. During instance mask fusion, the masks from each viewpoint are superimposed according to their weights, and the final mask boundary is generated through a binarization threshold. The defect probability map is weighted and averaged within the fused mask area to eliminate noise interference from low-quality views. When calculating appearance parameters, the shape index is taken as the weighted median of the calculated values ​​from each viewpoint to reduce the impact of outliers. After the grade classification probability vectors are fused, the highest probability category must have a confidence level exceeding 0.8; otherwise, a manual review mechanism is triggered. When a viewpoint fails to meet quality standards due to motion blur or abnormal lighting, the data from that viewpoint is automatically excluded, ensuring that the fusion result is generated only based on valid data. Through these steps, the accuracy and robustness of the multi-view analysis results are improved, providing reliable input for subsequent sorting control.

[0060] As a preferred embodiment, the solution of this application is specifically implemented as follows: For multi-view output results of the same apple, the analysis also includes: aligning the multi-view results according to a unified timestamp and assigning quality weights to each viewpoint according to image quality metrics. The instance masks of each viewpoint are quality-weightedly fused to obtain a fused instance segmentation result. The defect category probability maps of each viewpoint are quality-weightedly averaged and thresholded within the fused instance segmentation result to obtain a fused defect segmentation result. The appearance parameter results of each viewpoint are weighted and statistically analyzed according to quality weights to obtain a fused appearance parameter result. Finally, the level classification probability vectors of each viewpoint are quality-weightedly averaged, and the class with the highest probability is taken as the fused level classification result. When the quality metric of any viewpoint is lower than a preset quality metric threshold, the output corresponding to that viewpoint is excluded from the fusion process.

[0061] Specifically, the output results from multiple perspectives are first aligned based on a unified timestamp. For example, a time window can be set, and the multiple perspective results falling within this window can be considered as different perspective outputs of the same apple. Secondly, quality metrics such as sharpness and contrast are calculated for each perspective image, and a quality weight is assigned to each perspective accordingly. These quality weights can be normalized so that the sum of the weights for all perspectives is 1.

[0062] Furthermore, a quality-weighted fusion is performed on the instance masks from each viewpoint. The fusion method can employ weighted averaging or weighted voting, among other approaches. For example, for each pixel location, if that location is marked as foreground more than a threshold number of times in the instance masks across multiple viewpoints, it is then marked as foreground in the fusion result.

[0063] For the defect category probability map, a quality-weighted average is performed within the region determined by the fused instance segmentation result. Specifically, the defect category probability maps from each viewpoint are summed according to their corresponding quality weights, and then the average is obtained by dividing by the sum of the weights. Afterward, a thresholding operation is applied to the averaged probability map to obtain the final fused defect segmentation result.

[0064] The fusion of appearance parameter results can be achieved using a weighted average method. For example, for parameters such as color coverage and maximum diameter, the results from each viewpoint can be multiplied by their corresponding quality weights and then summed to obtain the fused result.

[0065] The fusion of classification results involves performing a quality-weighted average of the classification probability vectors from each perspective. Specifically, the probability vector of each perspective is multiplied by its quality weight, then the weighted probability vectors of all perspectives are summed, and finally, the category with the highest probability is selected as the fused classification result.

[0066] Therefore, to ensure the reliability of the fusion results, a preset quality threshold is set. When the quality index of a certain viewpoint is lower than this threshold, the output of that viewpoint will not participate in the fusion process, in order to avoid introducing low-quality data that could negatively impact the final result.

[0067] Through the above technical solution, this application achieves effective fusion of multi-perspective output results. By introducing quality weights and threshold mechanisms, the accuracy and reliability of the fusion results are improved. Simultaneously, by removing outputs from low-quality perspectives, the interference of abnormal data on the final result is reduced. This fusion method fully utilizes multi-perspective information, effectively improving the overall performance of apple sorting.

[0068] In some of the solutions mentioned above in this application, a method for apple sorting by acquiring and processing multi-view images was proposed. However, in the process of 3D pose estimation, due to the temporal inconsistency of multi-view images, it is difficult to accurately match the 2D key points of different views, which leads to an increase in the 3D key point reconstruction error, thereby affecting the accuracy of the 3D pose parameters and ultimately causing trajectory prediction deviation.

[0069] This application further proposes to estimate the 3D pose of an apple based on instance segmentation results and keypoint detection results, and to determine the trajectory information of the apple by combining the time information of the conveying process. This includes: defining an effective area using instance segmentation results; reading keypoint detection results within the effective area to obtain 2D keypoints from each viewpoint; pairing 2D keypoints of the same apple from each viewpoint according to a unified timestamp, and performing spatial triangulation using camera parameters and the inter-camera coordinate transformation matrix to obtain 3D keypoints; solving for 3D pose parameters based on the 3D keypoints in a way that minimizes reprojection error, and outputting position and attitude parameters; and transforming the 3D pose parameters to the conveyor belt coordinate system using a pre-calibrated coordinate transformation matrix as a unified representation of the 3D pose.

[0070] In this process, when defining the effective area for instance segmentation, background noise and interference points from adjacent apples are filtered out using a mask, retaining only the pixel area of ​​the target apple. During the pairing of 2D key points, a unified timestamp ensures that the key points of the same apple have temporal synchronization from different perspectives, avoiding spatial misalignment caused by conveyor belt movement. Spatial triangulation uses multi-view geometric constraints, utilizing camera intrinsic and extrinsic parameters to map 2D points to 3D space. To minimize reprojection error, an iterative optimization algorithm is used to adjust the coordinates and pose parameters of the 3D key points, minimizing the total error between the 2D points projected to each perspective and the actual detection points. The coordinate transformation matrix is ​​obtained through offline calibration, converting the pose parameters in the camera coordinate system to a coordinate system consistent with the direction of conveyor belt movement and the spatial reference.

[0071] Specifically, the mask generated from instance segmentation limits the keypoint detection range to the target apple area, eliminating interference points from other apples or the background and ensuring the accuracy of keypoint detection. Two-dimensional keypoints from different viewpoints are paired using a unified timestamp to eliminate keypoint misalignment caused by time asynchrony. The paired two-dimensional keypoints are then combined with camera parameters and a coordinate transformation matrix to calculate the three-dimensional keypoint coordinates through multi-view geometric relationships, utilizing spatial triangulation to improve 3D reconstruction accuracy. When solving for pose parameters based on the three-dimensional keypoints, a nonlinear optimization algorithm is used to adjust the pose parameters, minimizing the error between the two-dimensional coordinates projected onto each viewpoint and the detection results, thereby obtaining high-precision position and attitude parameters. Finally, the pose parameters are transformed to the conveyor belt coordinate system using a pre-calibrated coordinate transformation matrix, unifying the pose representation across different viewpoints and facilitating subsequent trajectory information calculation and sorting control parameter generation.

[0072] As a preferred embodiment, the solution of this application is specifically implemented as follows: Based on the instance segmentation results, a valid region is defined. Within this region, keypoint detection results are read to obtain 2D keypoints for each viewpoint. For example, for each viewpoint image, an apple contour mask is obtained using an instance segmentation algorithm, and this mask is used as the valid region. Within this valid region, a keypoint detection algorithm is applied to extract the apple's feature points, such as the 2D coordinates of keypoints like the stem and calyx.

[0073] Two-dimensional keypoints of the same apple from different viewpoints are paired based on a unified timestamp. Specifically, images of the same apple captured from different viewpoints are matched using a unified timestamp identifier. For the matched images, the coordinates of the corresponding two-dimensional keypoints are extracted.

[0074] Spatial triangulation is performed by combining camera parameters and the inter-camera coordinate transformation matrix to obtain 3D keypoints. The camera parameters include intrinsic matrix and distortion coefficients, while the inter-camera coordinate transformation matrix describes the relative positions and poses between different cameras. The spatial triangulation algorithm converts paired 2D keypoints into 3D spatial coordinates.

[0075] The 3D pose parameters are solved by minimizing reprojection error based on 3D keypoints, and the position and orientation parameters are output. Specifically, a nonlinear optimization algorithm, such as the Levenberg-Marquardt algorithm, is used to iteratively optimize the 3D pose parameters to minimize the error of reprojecting the 3D keypoints onto the image planes of each viewpoint. The optimized parameters are the 3D position and orientation of the apple.

[0076] The 3D pose parameters are transformed to the conveyor belt coordinate system using a pre-calibrated coordinate transformation matrix, serving as a unified representation of the 3D pose. For example, the transformation matrix from the camera coordinate system to the conveyor belt coordinate system is obtained through hand-eye calibration. The 3D pose parameters obtained in the previous step are then transformed to the conveyor belt coordinate system using this transformation matrix, achieving a unified representation of the acquisition results from different cameras.

[0077] Through the above technical solution, this application achieves accurate estimation of the 3D pose of an apple. Based on multi-view images and keypoint detection results, spatial triangulation and nonlinear optimization overcome the limitations of single-view estimation, improving the accuracy and robustness of pose estimation. Simultaneously, a unified timestamp mechanism ensures the synchronization of multi-view data, avoiding pose estimation deviations caused by time inconsistencies. Furthermore, the pose parameters are uniformly transformed into the conveyor belt coordinate system, providing a unified spatial reference for subsequent trajectory prediction and sorting control, which is beneficial for improving the coordination and reliability of sorting.

[0078] In some of the solutions mentioned above in this application, a method for determining apple trajectory information based on three-dimensional pose and conveying process time information is proposed. However, when the conveyor belt is running at high speed or apples are occluding each other, three-dimensional pose data may be lost or abnormal, resulting in discontinuous trajectory information or deviation from the actual motion state, which in turn affects the accuracy of sorting control parameters.

[0079] This application further proposes to align the 3D poses sorted by a unified timestamp with the displacement and velocity information provided by the encoder in a time sequence, establish a motion model based on the conveyor belt motion, and recursively estimate the 3D poses to form a position-time series as trajectory information; when occlusion or frame loss occurs, interpolation is performed based on the effective 3D poses and motion model within the most recent preset time period, and abnormal 3D poses are removed after consistency detection.

[0080] In this process, timing alignment matches the timestamps of the 3D pose with the physical motion state of the conveyor belt using the encoder's displacement pulses and velocity signals, ensuring that the position sequence is synchronized with the actual displacement of the conveyor belt. The motion model employs linear or nonlinear dynamic equations, combined with the encoder's real-time velocity, to recursively predict the 3D pose, compensating for trajectory discontinuities caused by image acquisition intervals. Interpolation, in the event of occlusion or frame loss, utilizes historical valid pose data and the motion model to generate pose estimates for missing moments, maintaining trajectory continuity. Consistency detection compares the residual between the current pose and the motion model's predicted value; if the residual exceeds a preset residual threshold, it is considered abnormal and discarded, preventing erroneous data from affecting subsequent sorting control.

[0081] Specifically, during conveyor belt operation, the real-time displacement pulses and velocity information output by the encoder are converted into a time-displacement relationship and matched with the three-dimensional pose marked with a unified timestamp to form a time-aligned position sequence. The motion model is constructed based on the encoder velocity parameters, such as using a uniform speed or uniform acceleration model, to predict the pose at the next moment, and the model parameters are updated using actual three-dimensional pose data. When pose data is missing at a certain moment due to occlusion or image acquisition failure, linear or spline interpolation is performed using the valid pose verified in the previous time period and the motion model to generate the pose estimate for that moment. For the acquired pose data, the deviation from the model prediction is calculated. If the deviation exceeds a preset range, it is judged as abnormal and discarded, and interpolation is triggered to fill the gap. Thus, the trajectory information remains continuous in the time dimension, and abnormal data is effectively filtered, ensuring that the sorting execution mechanism can perform grasping operations based on accurate trajectory parameters.

[0082] As a preferred embodiment, the solution of this application is specifically implemented as follows: The 3D pose data, sorted by a unified timestamp, is time-aligned with the displacement and velocity information provided by the encoder. Specifically, the number of displacement pulses and timestamp information output by the encoder are first obtained and converted into the actual displacement and velocity of the conveyor belt. Then, the 3D pose data is sorted according to a unified timestamp and interpolated to align with the encoder data in the time dimension.

[0083] Establish a motion model based on the conveyor belt's motion. For example, a uniform linear motion model can be used, assuming the apple's speed on the conveyor belt is the same as the conveyor belt's speed. Alternatively, a second-order motion model considering acceleration can be used to accommodate the non-uniform motion during the conveyor belt's start-up and shutdown processes.

[0084] The three-dimensional pose is recursively estimated to form a position-time series as trajectory information. Specifically, recursive estimation algorithms such as the Kalman filter are used, combined with motion models and observation data, to continuously estimate and predict the three-dimensional position of the apple, generating a trajectory sequence containing the position and corresponding timestamp.

[0085] When occlusion or frame loss occurs, interpolation is performed based on the effective 3D pose and motion model within the most recent preset time period. For example, a 5-frame time window can be set, and the missing pose can be estimated using methods such as polynomial interpolation or spline interpolation based on the effective pose data and motion model within this window.

[0086] Abnormal 3D poses are removed after a consistency check is performed. Specifically, the displacement and velocity changes between adjacent poses can be calculated; if they exceed preset displacement or velocity thresholds, they are determined to be abnormal poses. Additionally, statistical methods such as Mahalanobis distance can be used to detect whether the pose is consistent with the overall trajectory distribution, removing outliers with significant deviations.

[0087] Through the above technical solutions, this application can effectively improve the accuracy and robustness of apple trajectory prediction. By aligning the 3D pose with the encoder data in a timely manner, the time inconsistency problem in the multi-view image acquisition process can be eliminated, providing a reliable data foundation for subsequent trajectory estimation. Establishing a model based on conveyor belt motion can more accurately describe the motion characteristics of apples during the conveying process, improving the accuracy of trajectory prediction. Using a recursive estimation algorithm, continuous tracking and prediction of apple positions can be achieved, effectively meeting the real-time requirements of high-speed sorting lines. Through interpolation and consistency detection, abnormal situations such as occlusion and frame loss can be effectively handled, improving the robustness of trajectory estimation. These improvements provide more accurate and reliable trajectory information for subsequent sorting control, helping to improve the overall performance and stability of sorting.

[0088] In some of the solutions described above in this application, the lack of comprehensive consideration of defect distribution, three-dimensional pose, and motion timing during the generation of sorting control parameters can lead to sorting mechanism failures or damage to the fruit. For example, existing solutions do not consider the limitations imposed by defective areas on the selection of gripping points, which may result in gripping the defective parts and damaging the fruit; at the same time, the lack of timing verification combined with trajectory information causes the sorting mechanism to fail to complete the action within the correct time window.

[0089] This application further proposes to determine the target sorting position based on the classification results; to obtain the effective contact area by eliminating the defect segmentation results from the instance segmentation results; to select the gripping point and calculate the corresponding gripping pose within the effective contact area by combining the three-dimensional pose; to set the clamping parameters and safety boundaries based on the appearance parameter results and defect distribution information; to calculate the timing parameters and verify the arrival time window based on the trajectory information; and to combine the gripping pose, clamping parameters, timing parameters and target sorting position into sorting control parameters. When the arrival time window does not meet the preset conditions, a recalculation or skipping strategy is executed before outputting the sorting control parameters.

[0090] The effective contact area is generated through Boolean operations on instance segmentation masks and defect segmentation masks, ensuring that the gripping point is located only in defect-free areas. The gripping pose is calculated based on the attitude angles and position coordinates in the 3D pose parameters, combined with the geometric parameters of the gripper's end effector for inverse kinematics solution. Gripping parameters include the gripping force threshold and the gripper opening. The gripping force threshold is obtained by mapping the maximum diameter and shape index from the appearance parameters through a preset mechanical model. The safety boundary is determined based on the overlap between the minimum bounding rectangle of the defect distribution area and the projection of the gripper contact surface. The timing parameters are calculated by matching the position-time sequence in the trajectory information with the kinematic model of the sorting actuator. The arrival time window verification condition is whether the sorting actuator can reach the designated position at the target time point under maximum acceleration.

[0091] Specifically, in the process of generating sorting control parameters, the target sorting position is first determined by matching the grade classification results with a preset sorting position mapping table. The effective contact area is obtained by subtracting the mask area from the defect segmentation result from the mask area in the instance segmentation result, ensuring that the gripping operation avoids defective areas. Based on the 3D pose estimation results, a geometric constraint algorithm is used to select the area with the smallest curvature within the effective contact area as a candidate gripping point, and the contact point is optimized by combining the gripping surface size of the end effector of the gripper. In the process of setting the gripping parameters, the gripping force threshold is calculated based on the relationship between the maximum diameter of the apple and the skin hardness parameter, and the contact pressure distribution is adjusted by combining the shape index. The safety boundary is set by calculating the minimum safe distance between the outward expansion distance of the defect area and the contact surface of the gripper, preventing contact with the edge of the defect during the gripping process. The timing parameters are generated by inputting the position coordinates from the trajectory information into the kinematic model in a timestamp sequence. This process calculates the rotation angles and movement speeds of each joint in the sorting actuator. The arrival time window is verified by comparing the inverse kinematic solution with the timestamp difference. When the calculated time difference exceeds the actuator's response threshold, a recalculation strategy is triggered, reselecting the gripping point or adjusting the clamping speed. Through the synergistic effect of these steps, the fruit damage rate is effectively reduced while ensuring a high sorting success rate, and precise synchronization between the sorting action and the conveyor belt movement is achieved.

[0092] As a preferred embodiment, the specific implementation of this application is as follows: During the generation of sorting control parameters, the target sorting position is mapped to a preset grade area coding table based on the grade classification results for matching and determination. The effective contact area is obtained by Boolean operation of the instance segmentation mask and the defect segmentation mask to obtain the defect-free area. Within this area, the three points with the smallest curvature are selected as candidate gripping points by calculating the curvature of the three-dimensional point cloud. The final gripping point coordinates are further determined by combining the centroid projection position and the gripper opening direction. The gripping parameters calculate the gripping force threshold based on the maximum diameter and shape index, and set the anti-slip coefficient of the gripper contact surface based on the defect distribution density. The timing parameters predict the time window for the apple to arrive at the sorting execution mechanism through the position-time series in the trajectory information. If the deviation between the predicted time window and the idle time interval of the execution mechanism exceeds the set threshold, the timing parameters are regenerated by trajectory interpolation. If the conditions are still not met after three recalculations, the sorting operation of the current apple is skipped.

[0093] Through the above technical solution, this application effectively solves the problem of coordinated control between multi-dimensional detection results and sorting execution. By integrating grade classification, defect distribution, and motion trajectory information to generate accurate sorting parameters, it avoids grasping failures or fruit damage caused by parameter mismatch. At the same time, the dynamic verification mechanism based on time windows improves the fault tolerance of sorting and ensures the temporal reliability of sorting actions in high-speed continuous operation scenarios.

[0094] This application further proposes a method for sending sorting control parameters to the sorting execution mechanism and storing sorting results and quality feedback data after sorting is completed, including: sending sorting control parameters through a communication interface and receiving execution readiness confirmation; collecting quality feedback data including successful grasping identifier, clamping status parameters, pose tracking residuals and placement position confirmation during sorting execution; writing sorting results and quality feedback data into a database according to a unified timestamp after sorting is completed and recording a summary of sorting control parameters; triggering a preset retry or abort strategy before writing when no execution readiness confirmation is received or a grasping failure is detected.

[0095] The communication interface uses industrial Ethernet or RS-485 protocol to achieve real-time transmission of sorting control parameters. Execution readiness confirmation is completed via hardware handshake signals or software response messages. Quality feedback data is synchronously collected by sensors and a vision inspection device. Successful grasping is triggered by a pressure sensor threshold. Grip status parameters include the gripper opening angle and gripping force curve. Pose tracking residuals are calculated using the coordinate difference between the actual pose and the target pose. Placement position confirmation is jointly determined by the position encoder of the end effector and the photoelectric switch of the target placement area. The database uses timestamps as the primary key index, and the sorting control parameter summary includes hash values ​​of the grasping pose, gripping parameters, and target sorting position. The retry strategy is set to resend the sorting control parameters within a preset time window, and the abort strategy is set to skip the current apple and mark it as an abnormal record.

[0096] Specifically, sorting control parameters are transmitted to the actuator via a communication interface. Upon receiving the parameters, the actuator returns a readiness confirmation signal to ensure correct parameter loading. During sorting, gripping status parameters are monitored in real time to ensure the gripper's movements match preset parameters. Pose tracking residuals are used to assess deviations in the robotic arm's motion trajectory, and placement confirmation verifies whether the apples have accurately reached the target area. After sorting, all data is stored with a unified timestamp for easy traceability and analysis. In case of communication failure or gripping failure, a retry strategy prioritizes attempting recovery operations. If the number of retries is exceeded, a termination strategy is implemented to prevent production line congestion. Through this process, the reliability and data integrity of the sorting operation are improved, and robustness under abnormal conditions is enhanced.

[0097] As a preferred embodiment, the solution of this application is implemented as follows: Sorting control parameters are transmitted to the six-axis robotic arm controller via an industrial Ethernet interface, using a real-time communication message format. Upon receiving the parameters, the sorting actuator returns a readiness confirmation signal containing a checksum, which triggers the robotic arm startup program via a hardware interrupt. During execution, a force sensor installed at the end of the robotic arm collects clamping pressure data at a frequency of 100Hz. Binocular vision tracks the apple's displacement in real time and calculates the pose tracking residual. An infrared photoelectric switch detects whether the placement position matches the target sorting bin. After sorting, the robotic arm motion trajectory data, clamping pressure curve, pose residual peak value, and placement confirmation signal are associated with a unified timestamp and written in batches to a distributed time-series database using a database transaction mechanism. If a communication link interruption results in no readiness confirmation signal being received, the system automatically switches to a redundant communication channel to resend the parameters. When the clamping pressure data exceeds a preset safety threshold, the robotic arm immediately performs an emergency release action and generates a fault log, resuming operation after manual intervention.

[0098] Through the above technical solutions, this application solves the problems of isolated feedback data and lack of anomaly handling mechanisms in existing sorting systems, realizing end-to-end data traceability and closed-loop control of the sorting process. A unified timestamp mechanism ensures time alignment of data from all sensors, redundant communication and security strategies improve the reliability of sorting actions, and a transaction write mechanism guarantees the integrity and consistency of data storage, thereby reducing the risk of misoperation during the sorting process and improving the response speed to abnormal conditions.

[0099] In the above embodiments, a camera-light source-encoder synchronous triggering device is arranged on the conveyor belt. Using encoder pulses as a reference, hardware triggering and unified timestamp control of multiple cameras are achieved, ensuring strict temporal consistency in the acquisition of multi-view images of the same apple. Combined with preprocessing techniques such as geometric distortion correction, color calibration, and illumination equalization, standardized input data is generated, improving image quality and comparability from the source. Based on this, the standardized input data is input into a trained deep learning model, which simultaneously outputs instance segmentation results, defect segmentation results, appearance parameter results, and classification results during a single inference process. Defect detection and appearance parameter classification are achieved using instance masks as defined regions. Multi-dimensional joint modeling enhances the accuracy and interpretability of the analytical results; 3D pose estimation is performed based on instance segmentation results and key point detection results, and trajectory information is obtained by combining the time information of the conveying process, providing a basis for subsequent grasping timing prediction; sorting control parameters corresponding to apples are generated according to the grade classification results, defect distribution information and trajectory information, and sent to the sorting execution mechanism to realize sorting. At the same time, after sorting is completed, the sorting results and quality feedback data are stored in the database, realizing a closed-loop optimization process of identification-positioning-control-execution-feedback, thereby improving the accuracy, stability and adaptability of sorting while ensuring the cycle time of the high-speed sorting line.

[0100] In another preferred embodiment based on the above embodiments, see [reference] Figure 2 As shown, this embodiment provides an intelligent apple sorting system based on deep learning visual detection, used to apply the above-described intelligent apple sorting method based on deep learning visual detection, including: The acquisition unit includes at least one set of camera-light source-encoder synchronous triggering devices; The preprocessing unit is configured to implement hardware triggering of the multi-camera system based on encoder pulses and acquire multi-view images of the same apple with a unified timestamp, and to preprocess the multi-view images to obtain standardized input data. The parsing unit is configured to input standardized input data into a trained deep learning model, parse the output of the deep learning model, and obtain instance segmentation results, defect segmentation results, appearance parameter results, and grade classification results corresponding to the apple. The processing unit is configured to estimate the three-dimensional pose of the apple based on the instance segmentation results and key point detection results, and determine the trajectory information of the apple by combining the time information of the transportation process. The control unit is configured to generate sorting control parameters corresponding to the apples based on the grade classification results, defect distribution information, and trajectory information. The control unit is also configured to send the sorting control parameters to the sorting execution mechanism and store the sorting results and quality feedback data in the database after sorting is completed.

[0101] Specifically, the encoder in the acquisition unit generates a preset number of pulse signals, synchronously triggering all cameras to complete the exposure operation within a 0.5 millisecond time window, while simultaneously triggering the light source to maintain a constant brightness of 5000 lumens during the exposure. The preprocessing unit performs bilinear interpolation de-mosaic processing on the received RAW format images and uniformly adjusts the image resolution to 1280×1024 pixels. During the inference process, the shared feature map output from the feature extraction layer of the multi-branch neural network in the parsing unit is input into five parallel branches. The instance segmentation branch uses a U-Net structure to generate a binary mask, and the defect segmentation branch applies a cross-entropy loss function to optimize the classification boundary. When calculating the 3D pose, the processing unit establishes a reprojection error model for each keypoint. When the error exceeds 0.8 pixels, the iterative nearest-point algorithm is activated to optimize the pose. When the control unit generates sorting control parameters, it calculates the lead time of the gripping robot arm based on the conveyor belt speed. When the conveyor speed reaches 1.2 m / s, the lead time is set to 1.3 times the robot arm's response delay time. During system operation, the database module adopts a time-series storage structure. Each apple sorting record contains 64 bytes of metadata and a 512×512 pixel quality feedback image.

[0102] As a preferred embodiment, the solution of this application is implemented as follows: The acquisition unit consists of two sets of industrial cameras, a ring LED light source, and an encoder. The industrial cameras are symmetrically mounted on the gantry frames on both sides of the conveyor belt, and the ring light source is integrated into the front end of the camera lens. The encoder is mechanically connected to the conveyor belt drive shaft, and its pulse signal is synchronized to the camera and the light source through a trigger controller. The preprocessing unit is deployed in an edge computing device, which receives the encoder pulse signal through an FPGA module, triggers the dual cameras to acquire images in a synchronous mode with a frequency division coefficient of 2, and writes the encoder count value at the acquisition time as a unified timestamp into the image metadata. The preprocessing process includes a geometric distortion correction module and an illumination equalization module, wherein the geometric distortion parameters are pre-obtained through checkerboard calibration, and the illumination equalization adopts a dynamic compensation algorithm based on Retinex theory.

[0103] The parsing unit is equipped with a deep learning model featuring a multi-task learning architecture. This model includes a ResNet-50 feature extraction layer and five parallel branches: the instance segmentation branch uses a Mask R-CNN structure, the defect segmentation branch uses a U-Net structure, the appearance parameter branch includes a fully connected regression layer, the ranking classification branch includes a Softmax classification layer, and the keypoint detection branch uses a heatmap prediction structure. Model inference results are output via the NVIDIA Triton inference server. The instance segmentation and keypoint detection results are post-processed using the OpenCV library to generate mask data with instance IDs and 3D coordinate data.

[0104] The processing unit runs a motion estimation algorithm, solving for 3D pose parameters using the EPnP algorithm, and then uses a Kalman filter to fit the pose data at continuous timestamps. The control unit communicates with the six-axis robotic arm controller via the OPC UA protocol, converting the gripping point coordinates into joint angle commands, and simultaneously generating the servo motor trigger timing based on the trajectory prediction results. The database adopts a time-series database structure, storing quality feedback data and equipment status logs from the sorting process.

[0105] Through the above technical solutions, this application solves the problem of asynchronous time acquisition in multi-view image acquisition. A hardware synchronization mechanism triggered by encoder pulses ensures the time alignment accuracy of multi-view data, eliminating displacement deviations during imaging of moving objects. A multi-branch deep learning model is employed to achieve joint inference of appearance parameters, defect distribution, and grade classification, improving the dimensional integrity and spatial consistency of the detection results. A trajectory prediction method based on 3D pose estimation and motion models effectively compensates for positioning errors caused by conveyor belt vibration, ensuring precise matching between the robotic arm's grasping action and the apple's movement trajectory. Optimized database storage structure design enables real-time archiving and traceability of sorting process data, providing complete data support for equipment status monitoring and algorithm optimization.

[0106] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program goods. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program goods embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0107] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program goods according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0108] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0109] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0110] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation of the present invention. Any modifications or equivalent substitutions that do not depart from the spirit and scope of the present invention should be covered within the scope of protection of the claims of the present invention.

Claims

1. A method for intelligent apple sorting based on deep learning visual detection, characterized in that, include: At least one set of camera-light source-encoder synchronous triggering device is arranged on the conveyor belt. The multi-camera is hardware-triggered based on encoder pulses and multi-view images of the same apple are acquired with a unified timestamp. The multi-view images are preprocessed to obtain standardized input data. The standardized input data is input into a trained deep learning model, and the output of the deep learning model is analyzed to obtain instance segmentation results, defect segmentation results, appearance parameter results, and grade classification results corresponding to the apple. Based on the instance segmentation results and key point detection results, the three-dimensional pose of the apple is estimated, and the trajectory information of the apple is determined by combining the time information of the transportation process. Based on the grade classification results, defect distribution information, and trajectory information, sorting control parameters corresponding to apples are generated. The sorting control parameters are sent to the sorting execution mechanism, and the sorting results and quality feedback data are stored in the database after sorting is completed.

2. The intelligent apple sorting method based on deep learning visual detection according to claim 1, characterized in that, At least one set of camera-light source-encoder synchronous triggering devices is arranged on the conveyor belt. When hardware triggering of multiple cameras based on encoder pulses and acquiring multi-view images of the same apple with a unified timestamp, the following is included: Using the encoder pulse as the hardware trigger reference, the same trigger frequency division coefficient and unified timestamp generation rule are set for each camera, so that the same apple can obtain multi-view images with the same unified timestamp identifier from different camera positions. The brightness control of the light source is bound to the hardware trigger, so that the light source maintains a constant output during camera exposure and is turned off outside the exposure; wherein, the multi-view images of the same apple meet the preset field-of-view overlap ratio and minimum baseline distance.

3. The intelligent apple sorting method based on deep learning visual detection according to claim 2, characterized in that, Preprocessing the multi-view images to obtain standardized input data includes: For each frame of image, geometric distortion correction, color calibration, and illumination equalization are performed sequentially, and blurry and overexposed frames are removed based on quality indicators corresponding to a unified timestamp. The multi-view images that pass the quality screening are normalized in size and pixels according to a fixed view order and fixed size, and the output is standardized input data with a unified size, unified channel order, and unified timestamp identifier.

4. The intelligent apple sorting method based on deep learning visual detection according to claim 1, characterized in that, When inputting the standardized input data into a trained deep learning model and parsing the output of the deep learning model, the process includes: A multi-branch structure with a shared feature extraction network is adopted, wherein the instance segmentation branch outputs the instance mask and confidence map corresponding to the apple, the defect segmentation branch outputs the defect category probability map, the appearance parameter regression branch outputs the appearance parameter results of color coverage, maximum diameter and shape index, the grade classification branch outputs the grade classification probability vector, and the key point detection branch outputs the key point heat map or key point coordinates. During parsing, the effective area is defined by the instance mask. The target instance segmentation result is obtained by filtering with confidence threshold and selecting connected components. The defect category probability map is thresholded within the instance mask to obtain the defect segmentation result. The appearance parameter result is calculated within the instance mask based on pixel statistics and geometric measures. Thresholding and non-maximum suppression are applied to key point heatmaps or key point coordinates within the instance mask to generate key point detection results; The classification result is taken from the highest probability category of the classification probability vector. When there is an inconsistency between the classification result and the appearance parameter result or the defect segmentation result, the confidence level is adjusted according to the preset consistency rule before the final classification result is output.

5. The intelligent apple sorting method based on deep learning visual detection according to claim 4, characterized in that, For the multi-view output results of the same apple, the analysis also includes: aligning the multi-view results according to a unified timestamp, and assigning quality weights to each view according to image quality indicators. The instance masks from each viewpoint are fused using quality-weighted fusion to obtain the fused instance segmentation result. The defect category probability maps from each viewpoint are then averaged using quality-weighted fusion and thresholded within the fused instance segmentation result to obtain the fused defect segmentation result. The appearance parameter results from each viewpoint are weighted and statistically analyzed according to quality weights to obtain the fused appearance parameter results. The level classification probability vectors from each viewpoint are then averaged using quality-weighted fusion, and the class with the highest probability is taken as the fused level classification result. When the quality index of any viewpoint is lower than the preset quality index threshold, the output corresponding to that viewpoint is removed from the fusion process.

6. The intelligent apple sorting method based on deep learning visual detection according to claim 1, characterized in that, When estimating the 3D pose of the apple based on the instance segmentation results and keypoint detection results, the following steps are included: The effective region is defined by the instance segmentation results, and the key point detection results are read within the effective region to obtain the two-dimensional key points from each viewpoint; Based on a unified timestamp, 2D key points of the same apple from different perspectives are paired, and spatial triangulation is performed in combination with camera parameters and inter-camera coordinate transformation matrix to obtain 3D key points. Based on the three-dimensional key points, the three-dimensional pose parameters are solved in a way that minimizes the reprojection error, and the position and attitude parameters are output. The three-dimensional pose parameters are transformed into the conveyor belt coordinate system through a pre-calibrated coordinate transformation matrix as a unified representation of the three-dimensional pose.

7. The intelligent apple sorting method based on deep learning visual detection according to claim 6, characterized in that, When determining the apple's trajectory information by combining the time information of the transportation process, it includes: The 3D poses sorted by a unified timestamp are time-aligned with the displacement and velocity information provided by the encoder to establish a motion model based on the conveyor belt motion. The 3D poses are recursively estimated to form a position-time series as trajectory information. When occlusion or frame loss occurs, interpolation is performed based on the effective 3D pose and motion model within the most recent preset time period, and abnormal 3D poses are removed after consistency detection.

8. The intelligent apple sorting method based on deep learning visual detection according to claim 1, characterized in that, When generating sorting control parameters corresponding to apples based on the grade classification results, defect distribution information, and trajectory information, the following are included: Determine the target sorting location based on the classification results; The effective contact area is obtained by eliminating defective segmentation results from the instance segmentation results. Within the effective contact area, the gripping point is selected by combining the three-dimensional pose and the corresponding gripping pose is calculated. Set clamping parameters and safety boundaries based on appearance parameter results and defect distribution information; Calculate timing parameters based on trajectory information and verify arrival time windows; The grasping pose, gripping parameters, timing parameters, and target sorting position are combined into sorting control parameters. When the time window is reached but the preset conditions are not met, a recalculation or skipping strategy is executed before the sorting control parameters are output.

9. The intelligent apple sorting method based on deep learning visual detection according to claim 1, characterized in that, Sending the sorting control parameters to the sorting execution mechanism and storing the sorting results and quality feedback data in the database after sorting is completed includes: Send sorting control parameters through the communication interface and receive execution readiness confirmation; During the sorting process, quality feedback data is collected, including successful grasping indicators, clamping status parameters, pose tracking residuals, and placement location confirmation. After sorting is completed, the sorting results and the quality feedback data are written to the database according to a unified timestamp, and the corresponding sorting control parameter summary is recorded. When no execution readiness confirmation is received or a grabbing failure is detected, a preset retry or abort strategy is triggered before writing.

10. An intelligent apple sorting system based on deep learning visual detection, used to apply the intelligent apple sorting method based on deep learning visual detection as described in any one of claims 1-9, characterized in that, include: The acquisition unit includes at least one set of camera-light source-encoder synchronous triggering devices; The preprocessing unit is configured to implement hardware triggering of a multi-camera system based on encoder pulses and acquire multi-view images of the same apple with a unified timestamp, and to preprocess the multi-view images to obtain standardized input data. The parsing unit is configured to input the standardized input data into a trained deep learning model, parse the output of the deep learning model, and obtain instance segmentation results, defect segmentation results, appearance parameter results, and grade classification results corresponding to the apple. The processing unit is configured to estimate the three-dimensional pose of the apple based on the instance segmentation result and the key point detection result, and determine the trajectory information of the apple by combining the time information of the transportation process. The control unit is configured to generate sorting control parameters corresponding to the apples based on the grade classification results, defect distribution information, and trajectory information; the control unit is also configured to send the sorting control parameters to the sorting execution mechanism, and store the sorting results and quality feedback data in the database after sorting is completed.