Fruit and vegetable near-infrared image feature extraction and state prediction method
By employing a multi-agent sliding window collaborative sampling and multimodal temporal fusion method, the problem of identifying early aging signals in fruit and vegetable detection was solved, enabling accurate prediction and scientific regulation of fruit and vegetable status, and improving the robustness of detection and the stability of prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANJIN ACAD OF AGRI SCI
- Filing Date
- 2026-02-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to effectively identify early aging signals such as changes in the ultrastructure of fruit and vegetable skins, and they also struggle to perform deep alignment and adaptive fusion of image sequences and physicochemical indicators within a unified temporal framework, resulting in poor robustness and unstable prediction results in fruit and vegetable detection.
A multi-agent sliding window collaborative sampling mechanism and a multi-modal temporal fusion method are adopted to predict the status of fruits and vegetables by near-infrared image preprocessing, multi-agent sliding window collaborative sampling, timestamp alignment and fusion of image and physicochemical index data, and combined with a multi-task output module.
It improves the sensitivity of identifying early aging signals in fruits and vegetables, enhances the ability to scientifically regulate the state of fruits and vegetables, provides deeper biological characterization, and achieves accuracy in fruit and vegetable quality detection and stability in prediction.
Smart Images

Figure CN122244462A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of fruit and vegetable quality detection, intelligent grading and shelf life prediction technology, specifically involving a method for extracting near-infrared image features and predicting the state of fruits and vegetables. Background Technology
[0002] Near-infrared imaging technology, as a non-contact, non-destructive optical inspection method, has been widely used in the industrial field, mainly for the composition analysis, surface defect identification, and classification of static objects (such as ores, metal parts, textiles, etc.). Its basic principle is that objects have different absorption and reflection characteristics of light in different near-infrared bands. By acquiring and analyzing reflected or transmitted images, indirect assessment of the internal composition or surface condition of an object can be achieved.
[0003] However, directly applying traditional near-infrared imaging and feature extraction methods to fruits and vegetables—a typical post-harvest living biological system—faces significant technical bottlenecks and adaptation challenges, mainly in the following aspects: (1) Fruits and vegetables remain living organisms after harvest, continuously engaging in physiological metabolic activities such as respiration, transpiration, and enzymatic reactions. Their internal components, including water, sugars, organic acids, and fibrous tissue, are not only spatially unevenly distributed but also dynamically change over time, resulting in strong scattering, absorption, and multiple reflections of near-infrared light within the fruit and vegetable tissues. Furthermore, changes in the ultrastructure of the fruit and vegetable epidermis, such as the waxy layer and stomata, are often early signs of quality deterioration. Their optical response is weak and easily drowned out by complex background noise, such as water vapor film generated by respiration, uneven ambient lighting, and instrument thermal noise. Traditional image processing algorithms based on static scenes (such as fixed threshold segmentation and global feature statistics) struggle to effectively separate these weak physiological signals from noise, leading to low signal-to-noise ratios and poor robustness in feature extraction.
[0004] (2) The deterioration process of fruits and vegetables (such as chilling injury, freezing injury, browning, and rotting) usually begins in a local area (such as the stem, navel, or mechanical damage) and gradually spreads over time, exhibiting significant spatial heterogeneity. At the same time, changes in internal physicochemical indicators (such as decreased firmness and increased enzyme activity) often lag behind the appearance of external visual symptoms, and there is a complex nonlinear coupling relationship between the two. Existing general prediction models are mostly based on global image features or data at a single time point, which makes it difficult to effectively model this spatiotemporal heterogeneity of "local initiation and global evolution" and the dynamic relationship between appearance and internal quality, resulting in insensitivity to early deterioration identification and unstable prediction results.
[0005] (3) The assessment of the state of fruits and vegetables requires the integration of multi-source information, including appearance images and key physicochemical indicators (such as hardness, weight, respiration intensity, and enzyme activity). Although some studies have attempted to combine images and sensor data, they have mostly remained at the level of simple feature splicing or post-decision fusion, lacking methods for deep alignment, adaptive fusion, and joint dynamic modeling of image sequences and indicator sequences within a unified temporal framework. This limits the model's ability to capture the complete physiological evolution trajectory of fruits and vegetables and makes it difficult to provide interpretive predictions with clear biological significance.
[0006] Conventional methods such as convolutional neural networks (CNNs) typically use the entire image as input. Their feature extraction process can dilute subtle but crucial deterioration information (such as early lesions or slight wrinkling) due to operations like global pooling. For fruit and vegetable inspection tasks that require focused attention on specific areas (such as the area around the stem or suspected defects), these methods lack a mechanism similar to human experts' "active observation and focusing on suspicious areas," leading to insufficient learning of local features. Summary of the Invention
[0007] The purpose of this invention is to provide a method for extracting features and predicting the state of near-infrared images of fruits and vegetables, aiming to solve the technical bottleneck of general imaging algorithms in identifying early aging signals such as changes in the ultrastructure of the epidermis.
[0008] To achieve the above objectives, the present invention provides the following technical solution: This invention provides a method for extracting near-infrared image features and predicting the state of fruits and vegetables, characterized by the following steps: (1) collecting near-infrared image sequences of target fruits and vegetables under multiple temperature stress storage conditions and simultaneously collecting physicochemical index data, and establishing a unique identifier and corresponding timestamp for each sample; (2) preprocessing the near-infrared image sequences, the preprocessing including noise estimation and adaptive denoising, and logarithmic domain Wallis contrast enhancement; (3) extracting temporal image features from the preprocessed image sequences based on a multi-agent sliding window collaborative sampling mechanism; (4) aligning the temporal image features with the physicochemical index data with timestamps and performing multimodal fusion to obtain a fused temporal representation; (5) based on the fused temporal representation, simultaneously outputting the fruit and vegetable state classification results, the remaining shelf life regression results, and the risk level classification results through a multi-task output module.
[0009] Preferably, the noise estimation and adaptive denoising in step (2) includes: 2.1) extracting the high-frequency response map H from the normalized image and estimating the high-frequency noise level based on the median absolute deviation. And estimate background noise. ;2.2) Fusion and Obtain noise level estimate and based on Adaptively set nonlocal mean filter parameters, including filter strength. and the search window radius that monotonically increases with the noise level. With patch radius 2.3) Use nonlocal mean filtering with adaptive parameters to denoise the image.
[0010] Preferably, the logarithmic domain Wallis contrast enhancement in step (2) includes: 3.1) performing a logarithmic transformation on the denoised image to obtain a logarithmic domain image. 3.2) Calculate the mean of the logarithmic domain within the local window. with standard deviation 3.3) Perform the logarithmic field Wallis transform: ,in 3.4) The enhancement result is inversely transformed to the linear domain and dynamic range constraints are applied.
[0011] Preferably, the multi-agent sliding window collaborative sampling in step (3) includes: setting N agents on each frame of image, each agent acquiring image patches according to its local observation window; each agent updating its internal memory state through LSTM based on local observations and historical information, and outputting actions according to the policy network to adjust the window position and scale; using a shared convolutional neural network to extract the features of each agent's local observations, and aggregating all agent features through a communication fusion module to form a global image representation for each frame. This, in turn, constitutes a time-series image feature sequence. .
[0012] Preferably, the multimodal fusion in step (4) includes: normalizing and extracting features from the physicochemical index data to obtain an index feature sequence. Align the indicator feature sequence with the image feature sequence by timestamp; employ a gated fusion mechanism to perform image feature fusion at each time t. With indicator characteristics To merge: ,
[0013] in It is Sigmoid. This is an element-wise multiplication, where U is the dimension mapping matrix; This reflects the adaptive weights of the image mode and the index mode at that moment; The fused sequence Input a temporal network to obtain a temporal representation. The temporal network can be an LSTM. And the hidden states of the entire sequence are used as temporal representations: ,in Let H be the temporal hidden state at time t. H is the joint temporal representation of image information and physicochemical index information, which is used for subsequent classification, regression or anomaly detection tasks.
[0014] Preferably, the multi-task output module in step (5) includes: pooling the fused temporal representation to obtain a shared feature vector e; inputting e into the classification head, regression head and grading head respectively; the classification head outputs the probability distribution of fruit and vegetable status categories; the regression head outputs the predicted value of the remaining shelf life days; the grading head outputs the probability distribution of risk level, or obtains the risk level by threshold mapping based on the predicted value output by the regression head.
[0015] Preferably, the loss function of the multi-task output module is a weighted multi-task loss: in, Cross-entropy loss for state classification, Huber's loss for the remaining days of regression. Cross-entropy loss for risk level classification, Preset weights.
[0016] Preferably, the physicochemical index data includes at least two of hardness, weight, respiration rate, and enzyme activity.
[0017] Preferably, in step (3), the training of the multi-agent system adopts a centralized training and decentralized decision-making framework, and the global advantage function is calculated through a centralized value network. This is used to update the policy network parameters of each agent.
[0018] Preferably, the temperature stress storage conditions include at least four temperature gradients set on the basis of a reference temperature, with relative humidity kept stable under each gradient, and the number of samples under each temperature condition meeting the requirements for model training, verification and testing.
[0019] The beneficial effects of the present invention are: (1) Accurately analyze the characteristics of organisms and break through the bottleneck of fruit and vegetable detection: In view of the complex biological characteristics of fruits and vegetables, the present invention effectively eliminates the interference caused by the respiration of fruits and vegetables and tissue refraction through logarithmic domain Wallis enhancement and adaptive noise reduction. It can keenly capture subtle physiological signals reflecting early aging, such as changes in the ultrastructure of the fruit and vegetable epidermis, and overcome the problems of "low signal-to-noise ratio and difficulty in feature extraction" in the field of fruits and vegetables in the general industrial method.
[0020] (2) Multi-agent collaborative sampling solves the problem of local deterioration assessment: The multi-agent sliding window collaborative sampling mechanism imitates the scanning and observation habits of human experts on key parts of fruits and vegetables (such as stem, navel and potential disease spots), which can effectively prevent local defect features from being diluted by global features, and greatly improve the sensitivity of identification of complex states such as hidden cold damage and internal browning.
[0021] (3) Multimodal temporal fusion enhances the value of scientific regulation: This invention combines dynamic image features with key physicochemical indicators (hardness, enzyme activity, etc.), which can not only predict state classification, but also simulate the nonlinear process of postharvest physiological metabolism of fruits and vegetables through deep learning. This modeling method can provide a deeper level of biological characterization, which has a theoretical basis and important significance for regulating postharvest softening and senescence of fruits, and provides technical support for realizing the transformation of fruits and vegetables from "postharvest monitoring" to "scientific management". Attached Figure Description
[0022] Figure 1 This is a schematic diagram of the overall process of the method of the present invention; Figure 2 This is a schematic diagram of the near-infrared image preprocessing results; Figure 3 Schematic diagram of near-infrared image preprocessing effects of melon fruits under different storage temperature conditions. Figure 4 A schematic diagram of multi-agent sliding window collaborative sampling for a single frame of near-infrared image; Figure 5 This is a schematic diagram of the local observation network structure; Figure 6 A diagram illustrating the updating of network structure for beliefs and actions; Figure 7 This is a schematic diagram of the A2C decision optimization method. Detailed Implementation
[0023] The technical solution of the present invention will now be described in detail with reference to the accompanying drawings and embodiments. It should be noted that the following embodiments are only used to explain the present invention and do not constitute a limitation on the scope of protection of the present invention.
[0024] like Figure 1 As shown, (1) Data preparation 1) Sample source and pretreatment Before collection, fresh samples of the target fruits and vegetables were selected, ensuring they were uniformly mature, free from mechanical damage, disease spots, and rot. To minimize the impact of field heat and surface moisture differences on near-infrared imaging and subsequent index measurements, the samples underwent a heat-reduction treatment before storage. Specifically, the samples were placed in a cool, well-ventilated environment and left to stand for approximately 12 hours to allow the fruit temperature to approach ambient temperature. Subsequently, appearance screening and grouping were performed, with screening criteria including, but not limited to: weight range, appearance color / stripes distribution, surface integrity, fruit size, and the integrity of the fruit stalk / navel. The screened samples were then placed in foam mesh bags or other cushioning packaging to reduce damage caused by friction during handling during storage.
[0025] 2) Temperature stress storage conditions and sample grouping A baseline temperature T0 is set based on the optimal storage temperature of the sample under normal temperature conditions. Temperature gradients are then set upwards and downwards from T0 to form at least four temperature stress storage conditions (e.g., low-temperature mild stress, low-temperature severe stress, baseline temperature, and slightly higher temperature). The relative humidity and ventilation conditions are kept consistent under each temperature gradient (preferably controlled within a stable range) to avoid humidity becoming the main contaminating factor.
[0026] The number of samples at each temperature condition should meet the training / verification / testing requirements, and samples should be drawn from each temperature condition at each sampling time according to preset rules for the determination of physicochemical indicators. The samples used for image acquisition can be from the same batch or have the same number as the physicochemical indicator samples. It is preferred to use samples with the same number to achieve a strict correspondence between "image - indicator - label".
[0027] 3) Sampling timeline and batch definition (ensuring the "sequence" attribute) Near-infrared image sequence acquisition is performed over time, including at least the initial time t0 (after storage / heating) and multiple subsequent times t1…tn. The time interval can be a fixed interval (e.g., daily or every 12 hours) or adaptively set according to the storage stage.
[0028] To facilitate subsequent time-series network modeling, a unique identifier ID is established for each sample or batch, and metadata such as temperature gradient number, storage time, sampling time, imaging sequence number, physicochemical index measurement timestamp, operator and equipment number are recorded, thus forming a data structure that is "alignable by timestamp".
[0029] 4) Near-infrared image acquisition system and imaging specifications Near-infrared image acquisition employed a near-infrared camera imaging system, with the optimal spectral range covering the near-infrared region relevant to the moisture / sugar / tissue structure of fruit and vegetable samples. To ensure comparability of the same sequence, the imaging environment was kept as constant as possible, including: Fix the type and arrangement of the light source (near-infrared LED or halogen lamp, etc.), and record the power of the light source, the angle of incidence and the distance; Fixed camera exposure, gain, lens focal length, working distance, and field of view; Use a light-shielding box or darkroom structure to reduce ambient light interference; Use a uniform background plate (low reflection, stable near-infrared response) and a fixed support platform position; Perform black-and-white reference correction or dark field / flat field correction for each image (such as using a white board reflectivity reference and lens occlusion for dark fields) to reduce device drift and illumination non-uniformity.
[0030] During data collection, one or more perspectives (e.g., front, side, rotation) can be captured for each sample, and the sample posture should be standardized (based on the direction of the fruit stalk or the fruit navel) to reduce feature differences caused by posture variations. If the same sample is imaged from multiple perspectives, attention should be paid to ensuring that the perspective numbers are consistent.
[0031] 5) Methods for collecting physicochemical indicators and their corresponding relationships At each sampling time, corresponding physicochemical data should be collected, including hardness, weight, respiration rate, and enzyme activity. It is recommended that each indicator be collected using standard instruments and a unified procedure, ensuring that the data is collected "at the same time as the image or with a controllable time difference."
[0032] To ensure the alignment of "image-physicochemical indicators", it is preferable to perform non-destructive or micro-destructive measurements such as weight and hardness on the same sample immediately after near-infrared imaging is completed. For destructive indicators such as enzyme activity, a pairing rule is adopted to set "imaging samples" and "physicochemical test samples" within the same batch, and the pairing relationship is recorded in the metadata.
[0033] 6) Label / Status Definition and Dataset Organization Based on the appearance symptoms, changes in the tissue of the cut surface, the threshold of physicochemical indicators, or the results of expert judgment, a status label is assigned to the sample at each time point; Simultaneously, record the end-of-shelf-life criteria (e.g., the moment when the weighted score of physicochemical indicators reaches a threshold) for use in generating regression labels for the remaining days of shelf life; The risk level can be obtained by mapping the remaining shelf life threshold, and the version number of the mapping rule is saved in the data collection table to ensure traceability.
[0034] (2) Data processing 1) Noise assessment S1-1. Input and Normalization Acquire raw near-infrared image frames (t is the frame number), its pixel grayscale is b bits, which is obtained by first normalizing. : ,in Indicates the pixel position.
[0035] S1-2. High-frequency response extraction The normalized image is subjected to second-order difference to extract the high-frequency noise response, resulting in the high-frequency response map H: Where * denotes convolution, This represents the Laplace operator.
[0036] S1-3. Noise Estimation Based on Median Absolute Deviation Robust statistical estimation of H is used to estimate its standard deviation, thus obtaining the high-frequency noise estimate. : When H is approximately zero mean, a simplified form can be used:
[0037] S1-4. Background Area Statistics Selecting the background area in the image (For example, low-texture, low-gradient regions, or preset ROIs), calculate the background mean. And estimate background noise :
[0038] S1-5. Noise Level Fusion and Sequence Smoothing Will and The noise level of the current frame is obtained by fusion. We can use a weighted average or a conservative approach: or
[0039] To improve sequence consistency, time-series smoothing is performed on the noise estimation. : ,in , is a coefficient.
[0040] 2) Nonlocal means denoising S2-1. Parameter adaptive setting Based on noise level Adaptively determine the NLM filter strength parameter h: ,in This is a scaling factor. Furthermore, to balance computational efficiency and noise reduction effectiveness, the search window radius... With patch radius The setting increases monotonically with the noise level: And satisfy the monotonicity constraint: ,in , It can be a piecewise constant function or a continuous function, and its value is limited to a positive integer (pixel radius).
[0041] S2-2. Patch Similarity Measurement For each pixel x, traverse candidate pixels y within its search window S(x) and compute the weighted squared difference distance using the patch set P (the neighborhood centered at x or y): , where G(k) is the weight kernel within the patch, used to emphasize the center of the patch or suppress the edge effect.
[0042] S2-3. Weight Calculation and Normalization The weights are calculated based on the distance d(x,y) and then normalized: w(x,y)=(1 / Z(x))*exp(-d(x,y) / h^2), , where Z(x) is the normalization factor to ensure that the weight sum is 1.
[0043] S2-4. NLM Reconstruction Output The denoised image is obtained by weighted summation of pixels within the search window. : The denoising results for each frame are obtained. .
[0044] 3) Wallis contrast enhancement in the logarithmic field S3-1. Logarithmic field mapping To reduce the impact of multiplicative illumination variations and improve visibility in dark areas, a logarithmic transformation is performed on the denoised image: ,in To avoid extremely small positive numbers with singular logarithms.
[0045] S3-2. Calculation of Local Statistics Calculate the local mean of the logarithmic field within a local window W(x) centered at x. And local standard deviation s(x): ,
[0046] S3-3. Wallis Enhancement (Logarithmic Field) By applying the Wallis transform to the logarithmic domain, the local mean and local contrast are adjusted to the target level, resulting in an enhanced logarithmic domain image. : ,in Let b(x) be the target mean, b(x) be the luminance bias coefficient, and g(x) be the local gain factor.
[0047] S3-4. Basic Gain Settings Based on target standard deviation Set the gain relative to the local standard deviation s(x): ,in To prevent stable terms with excessively small denominators, and to avoid abrupt changes in overall contrast, a mixing coefficient can be introduced. :
[0048] S3-5. Amplitude Limiting Constraint (Suppressing Over-Enhancement) By applying an upper limit to the gain, we obtain the clipped gain. : ,in This is the preset maximum gain.
[0049] S3-6. Noise Suppression Constraints Construct an attenuation factor related to noise level The gain decreases when the noise level is higher or the local contrast is lower. The final gain is obtained: And replace g(x) in S3-3 with Achieve enhancement.
[0050] S3-7. Inverse Transformation and Output Constraints The enhancement result is obtained by mapping the logarithmic field back to the linear field. : The final preprocessed result is obtained by applying dynamic range constraints to the output. (Within the normalization range): .
[0051] (3) Model building For the near-infrared image sequence after step (2) preprocessing Multiple sampling agents are deployed in each frame of the image plane. Local observations are formed through a sliding window and collaborative sampling is carried out. The local observation features of each agent are extracted using a convolutional neural network, and information sharing and feature fusion are achieved through a communication fusion module to obtain the global image representation corresponding to each frame, thereby forming an image feature sequence that changes over time.
[0052] 1) Setting up the sampling agent and generating the local observation window Preprocessing images in each frame N sampling agents are set up on top The local observation window parameters of the i-th agent at time t are defined as follows: ,in Indicates the coordinates of the center or top-left corner of the window. This represents the window scale (e.g., side length or scale factor). The window region is then generated. Local observations were obtained: ,in Indicates by exist It can be cropped and optionally normalized / resampled to a fixed resolution.
[0053] Each sampling agent moves within the image plane in a sliding window manner and adjusts its scale to achieve collaborative sampling; to ensure sampling effectiveness, [the following can be done]: Apply boundary constraints: ,in Define the domain for the image.
[0054] 2) Cooperative sampling and state recognition The multi-agent cooperative sampling process is modeled as a locally observable Markov decision process, the elements of which are: ,in: The global hidden state space (representing the potential information of the target / texture / structure in this frame and the joint window state of the agent); This represents the action space of the i-th agent. Let be the state transition probability. For joint action; Let i be the observation space of the i-th agent; For global reward function; This is the discount factor.
[0055] Each agent obtains a local observation at time t. And based on its own historical information, it forms an internal memory state (belief state): ,in For the observation coding function, Used to characterize temporal information under locally observable conditions; both belief updates and action updates are implemented using LSTM.
[0056] 3) Action definition, strategy output, and decentralized decision-making Actions of the i-th agent Includes at least one or more of the following: location movement: Updated by distance step length or continuous offset; scale adjustment: Zoom in / out; Window selection: Select a window index from the candidate window set.
[0057] The action applied to the update of window parameters can be represented as: ,in Functions for updating the window state.
[0058] During the decentralized execution phase, each agent makes decisions based on its own information and shared information fused from communication. Its strategy is as follows: ,in For communication information or fused messages from other intelligent agents, These are the policy network parameters.
[0059] 4) Local observation feature extraction and communication fusion For each local observation Convolutional neural networks are used to extract local feature vectors: ,in For convolutional feature extraction networks with shared or partially shared features, Its parameters.
[0060] To enable information sharing among intelligent agents, a communication fusion module is set up. The features of all agents are fused to obtain the shared message or global aggregated feature for each agent. The global aggregated message is used: .
[0061] Furthermore, the fused features are used to construct a global image representation for this frame: ,in It can be implemented in various forms such as fully connected mapping after splicing, weighted summation, attention fusion, or graph network aggregation; This represents the global characterization of the near-infrared image frame. This leads to a time-varying sequence of image features: .
[0062] 5) A multi-agent learning framework with centralized training and decentralized decision-making A multi-agent framework of centralized training and decentralized decision-making (CTDE) is adopted to learn the policy. During the training phase, a centralized value network (critic) is used to evaluate the value under the joint state / joint information to stabilize multi-agent collaborative learning; during the execution phase, each agent outputs an action independently.
[0063] Optimization method, defining a centralized value function: ,in Here are the parameters of the value network. The advantage function can be expressed as: The policy network parameters are updated according to the gradient direction: Value network parameters are updated according to regression loss: ,in For overall rewards, a weighted combination related to improving recognition accuracy, increasing sampling coverage, reducing window redundancy, and smoothing action constraints can be set according to the task objectives: each Preset weights.
[0064] Through the above training, N sampling agents are enabled to perform cooperative window selection and movement within the same image frame. Furthermore, the memory capabilities of LSTM are used temporally to update beliefs and actions, ultimately outputting a stable and discriminative sequence of image features. This provides input representations for subsequent identification, tracking, or status assessment.
[0065] (4) Data fusion Feature extraction is performed on the physicochemical index data acquired synchronously with the near-infrared image sequence to obtain the index feature sequence; the index feature sequence is aligned and fused with the image feature sequence obtained in step (3) according to the timestamp, and the fused multimodal sequence is input into the time series network to obtain the time series representation for subsequent state recognition, trend prediction or risk assessment.
[0066] 1) Representation of physicochemical index data and extraction of index features Let the original vector of physicochemical indicators collected at time t be: Where M represents the physicochemical index dimension. The physicochemical indexes are then normalized. ,in , Let be the mean and standard deviation of the j-th indicator.
[0067] Furthermore, through the indicator feature extraction network, Mapped to low-dimensional index features: ,in It is a multilayer perceptron (MLP). For parameters; Let be the index eigenvector at time t.
[0068] 2) Timestamp alignment Let the image feature sequence be ,in Image frame timestamps; physicochemical index sequences are... When the sampling frequencies of the two are inconsistent, alignment mapping is performed based on the timestamp to obtain the index vector corresponding to each frame of the image. For example, using nearest neighbor alignment: , .
[0069] This yields the index features after alignment by image frame:
[0070] 3) Multimodal fusion The image features from step (3) Aligned indicator features At time t, the fusion is performed to obtain the fusion features. Integration method: , ,in It is Sigmoid. This is an element-wise multiplication, where U is the dimension mapping matrix; This reflects the adaptive weights of the image mode and the index mode at that moment.
[0071] 4) Temporal network modeling and temporal representation output The fused sequence Input a temporal network to obtain a temporal representation. The temporal network can be an LSTM. And the hidden states of the entire sequence are used as temporal representations: ,in Let H be the temporal hidden state at time t. H is the joint temporal representation of image information and physicochemical index information, which is used for subsequent classification, regression or anomaly detection tasks.
[0072] Through the above steps, the physical and chemical index data are characterized and strictly aligned and fused with the image feature sequence in the time dimension to obtain a multimodal temporal representation that combines spatial appearance information and physical and chemical state information, thereby improving the accuracy and robustness of the recognition of the target state evolution over time.
[0073] (5) Result characterization Based on the multimodal time series representation obtained in step (4), a multi-task output module is constructed to simultaneously output the melon status classification results, the regression results of the remaining days of shelf life, and the discrete risk level classification results, thereby realizing the joint assessment of the melon quality status and preservation risk.
[0074] 1) Input and Shared Representations Let step (4) output the hidden state of the entire sequence. The vector obtained by pooling: ), where Pool(·) can be average pooling, attention pooling, or the end-time representation. e is used as a shared feature input to the multi-task prediction head.
[0075] 2) Fruit and vegetable status classification output Let the set of fruit and vegetable status categories be . (For example, "fresh / saleable / critical / spoiled"). The probabilities of each category are output through a fully connected layer and Softmax.
[0076] The prediction category is:
[0077] The corresponding training loss (cross-entropy) is: ,in One-hot annotation for the true category.
[0078] 3) Regression output of remaining shelf life days Assume the actual remaining shelf life days are labeled as follows: (Unit: days), output predicted values via regression head: To improve robustness, Huber loss can be used: Among them, residual , This is a preset threshold.
[0079] 4) Discrete risk level classification output Let the risk level set be (e.g., low / medium / high / extremely high). The risk probability is output through a fully connected layer: The predicted risk level is: Its training loss also uses cross-entropy:
[0080] The risk level can be obtained by threshold mapping from the remaining shelf life days in the regression output to enhance interpretability:
[0081] For example:
[0082] Where d is the predicted number of remaining days. The preset grading threshold is used.
[0083] 5) Optimization of multi-task joint training objectives To optimize the three output tasks simultaneously, a weighted multi-task loss function is used: ,in These are preset weights used to balance the contributions of classification, regression, and grading tasks.
[0084] Through the aforementioned multi-task output mechanism, it is possible to simultaneously obtain fruit and vegetable status classification results, shelf life remaining days prediction results, and risk level classification results based on time-series representation within the same model framework. This enables joint, quantitative, and graded assessment of fruit and vegetable quality status and shelf life risk, improving the accuracy and real-time performance of the assessment.
[0085] In this embodiment, melon (variety: thick-skinned melon) was used as the experimental subject.
[0086] 1.1 Sample pretreatment: Collect melons of uniform maturity (approximately 90% ripe) and without damage, and place them in a ventilated area to cool down for 12 hours. Select samples weighing 1.2kg-1.5kg with uniform surface color and intact stems.
[0087] 1.2 Experimental Grouping: Set the reference temperature T0 to 8℃. Set four temperature gradients: Group A: 0℃ (low temperature stress, prone to freezing damage). Group B: 4℃ (low temperature stress, prone to chilling injury). Group C: 8℃ (recommended storage temperature); Group D: 12℃ (higher temperature, simulating shelf environment).
[0088] Each group contains 50 samples, and each sample is labeled with a unique ID QR code.
[0089] 1.3 Sampling Definition: The sampling period was 24 hours, and the samples were stored for 32 days under conditions of 75-80% relative humidity. At each time point t, near-infrared images were acquired first, followed immediately by measurements of physicochemical properties. Among these, hardness and enzyme activity destructive properties were measured using samples from the same batch.
[0090] 1.4 Image Acquisition Specifications: A near-infrared camera (center wavelength 840nm, half-bandwidth 10nm, wavelength 800nm~850nm, relative quantum efficiency 30%-40%) was used. The light source consisted of four symmetrically arranged near-infrared LEDs with a center wavelength of 840nm, a power of 34mW per LED, and an incident angle of 45 degrees. Reflectivity correction was performed using a standard white PTFE board.
[0091] Image preprocessing implementation steps (corresponding appendix) Figure 2 Appendix Figure 3 ) 2.1 Noise Assessment and Denoising: As attached Figure 2 As shown, the original NIR image is affected by thermal noise from the photosensitive device.
[0092] Step S1-1: Normalize the pixel values to [0, 1].
[0093] Step S1-3: Estimate the high-frequency noise level using the median absolute deviation (MAD) In this example, the background area is calculated. noise standard deviation .
[0094] Step S2-1: Set NLM parameters, scaling factor For group D (late stage of decay) with higher noise, the search window radius is automatically increased. Reduced to 15 pixels. The denoised image is attached. Figure 2 As shown in the middle image, edge details are preserved and graininess is eliminated.
[0095] 2.2 Wallis Contrast Enhancement: To address the issue of uneven lighting on the surface of melons (bright in the center and dark at the edges), the logarithmic domain is enhanced.
[0096] Step S3-3: Set the target mean , target standard deviation .
[0097] Effect: The enhanced image (accompanied by the Figure 2 right figure) clearly shows the reticulation changes on the surface of the melon, especially the Figure 3 tiny water stain-like lesions at the initial stage of decay in (d) become visible to the naked eye.
[0098] Feature extraction based on multi-agent sliding window (corresponding to Figure 4 , 5 , 6, 7) The present invention constructs a collaborative sampling system composed of agents.
[0099] 3.1 Agent interaction logic: Action space: The agent can move within the 1024×1024 image plane, and the exploration steps per round are 20.
[0100] Local observation: Each agent carries a 10×10 sliding window, as shown in Figure 4 . At time t, the agents lock the stem, side, and suspected defect areas of the melon respectively.
[0101] Neural network: Adopt the convolutional structure shown in Figure 5 . It contains 3 layers of Conv2d (convolution kernel 3x3) and MaxPool2d. The output local feature vector has a dimension of 256.
[0102] 3.2 Communication and decision-making: Belief update: As shown in Figure 6 , integrate historical sampling information through the LSTM module (hidden layer dimension 512).
[0103] Collaborative mechanism: The agents share the global aggregated features . If agent 1 discovers a suspected rotten point, the signal it emits will guide agent 2 to move to the surrounding area for verification (such as the window displacement from time t to time t+1 in Figure 4 ).
[0104] Training algorithm: Adopt the A2C (Advantage Actor-Critic) framework shown in Figure 7 . In the reward function , the recognition accuracy weight , the sampling coverage weight .
[0105] Multimodal fusion and state prediction 4.1 Data Alignment: After normalization, the physicochemical properties (hardness, weight loss rate, etc.) are used to extract feature vectors through a two-layer MLP. If the frequency of physical and chemical sampling is lower than that of image sampling due to loss, then the nearest neighbor alignment algorithm is used to fill in the timestamp.
[0106] 4.2 Fusion Modeling: Gating fusion method is adopted: When obvious signs of deterioration (such as decay) appear in the image. The weights will automatically be assigned to image features Tilting; when external visual features are not obvious but internal hardness decreases sharply. It will be biased towards indicator characteristics .
[0107] 4.3 Multi-task output: The system ultimately outputs results in three dimensions: 1. Classification: Output the probability that the melon belongs to "fresh, marketable, critical, or spoiled".
[0108] 2. Regression: Predicting remaining shelf life. For example, for a sample from group C on day 10, the system predicts a remaining shelf life of 5.2 days.
[0109] 3. Risk Level: Based on the regression value, the risk level is output according to the preset threshold (e.g., less than 2 days remaining is considered extremely high risk), providing decision-making suggestions for warehouse management.
[0110] Experimental Results Analysis The method of this invention achieves an accuracy of 95.9% in identifying the quality status of melons, which is 7.2 percentage points higher than a single global CNN model. In particular, in the early stage of cold damage identification, the multi-agent sliding window's ability to capture local micro-lesions provides an early warning 24-48 hours in advance.
[0111] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for extracting features and predicting the state of near-infrared images of fruits and vegetables, characterized in that, The steps include: (1) collecting near-infrared image sequences of target fruits and vegetables under multiple temperature stress storage conditions and simultaneously collecting physicochemical index data, and establishing a unique identifier and corresponding timestamp for each sample; (2) The near-infrared image sequence is preprocessed, including noise estimation and adaptive denoising, and logarithmic domain Wallis contrast enhancement; (3) Extract time-series image features from the preprocessed image sequence based on the multi-agent sliding window collaborative sampling mechanism; (4) The time-series image features and the physicochemical index data are time-stamp aligned and multimodal fused to obtain a fused time-series representation; (5) Based on the fusion time series representation, the fruit and vegetable status classification results, shelf life remaining days regression results and risk level classification results are simultaneously output through the multi-task output module.
2. The method according to claim 1, characterized in that, The noise estimation and adaptive denoising in step (2) includes: 2.1) Extracting the high-frequency response map H from the normalized image and estimating the high-frequency noise level based on the median absolute deviation. And estimate background noise. ;2.2) Fusion and Obtain noise level estimate and based on Adaptively set nonlocal mean filter parameters, including filter strength. and the search window radius that monotonically increases with the noise level. With patch radius 2.3) Use nonlocal mean filtering with adaptive parameters to denoise the image.
3. The method according to claim 1, characterized in that, The logarithmic domain Wallis contrast enhancement in step (2) includes: 3.1) performing a logarithmic transformation on the denoised image to obtain a logarithmic domain image. 3.2) Calculate the mean of the logarithmic domain within the local window. with standard deviation 3.3) Perform the logarithmic field Wallis transform: ,in 3.4) The enhancement result is inversely transformed to the linear domain and dynamic range constraints are applied.
4. The method according to claim 1, characterized in that, The multi-agent sliding window collaborative sampling in step (3) includes: setting up N agents on each frame of image, each agent acquiring image patches according to its local observation window; each agent updating its internal memory state through LSTM based on local observations and historical information, and outputting actions according to the policy network to adjust the window position and scale; using a shared convolutional neural network to extract the features of each agent's local observations, and aggregating all agent features through a communication fusion module to form a global image representation for each frame. This, in turn, constitutes a time-series image feature sequence. .
5. The method according to claim 1, characterized in that, The multimodal fusion in step (4) includes: normalizing and extracting features from the physicochemical index data to obtain the index feature sequence. Align the indicator feature sequence with the image feature sequence by timestamp; employ a gated fusion mechanism to perform image feature fusion at each time t. With indicator characteristics To merge: , ;; in It is Sigmoid. This is an element-wise multiplication, where U is the dimension mapping matrix; This reflects the adaptive weights of the image mode and the index mode at that moment; The fused sequence A temporal representation is obtained by inputting the data into a temporal network, which may be an LSTM. And the hidden states of the entire sequence are used as temporal representations: ,in Let H be the temporal hidden state at time t. H is the joint temporal representation of image information and physicochemical index information, which is used for subsequent classification, regression or anomaly detection tasks.
6. The method according to claim 1, characterized in that, The multi-task output module in step (5) includes: pooling the fused temporal representation to obtain a shared feature vector e; inputting e into the classification head, regression head and grading head respectively; the classification head outputs the probability distribution of the fruit and vegetable status categories; the regression head outputs the predicted value of the remaining shelf life days; the grading head outputs the probability distribution of the risk level, or obtains the risk level by threshold mapping based on the predicted value output by the regression head.
7. The method according to claim 6, characterized in that, The loss function of the multi-task output module is a weighted multi-task loss: in, Cross-entropy loss for state classification, Huber's loss for the remaining days of return. Cross-entropy loss for risk level classification, Preset weights.
8. The method according to claim 1, characterized in that, The physicochemical indicators include at least two of the following: hardness, weight, respiration rate, and enzyme activity.
9. The method according to claim 1, characterized in that, In step (3), the training of the multi-agent system adopts a centralized training and decentralized decision-making framework, and the global advantage function is calculated through a centralized value network. This is used to update the policy network parameters of each agent.
10. The method according to claim 1, characterized in that, The temperature stress storage conditions include at least four temperature gradients set on the basis of the reference temperature, with relative humidity kept stable under each gradient, and the number of samples under each temperature condition meeting the requirements for model training, validation and testing.