A rapid detection method for lycopene content based on multispectral data
By combining multispectral imaging technology with multivariate statistics and model residual analysis, the problem of rapid, accurate, and non-destructive detection of capsanthin content has been solved, enabling efficient and intelligent chili quality sorting and processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BAODING UNIV
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing chemical detection methods are highly destructive, time-consuming, and labor-intensive in detecting capsanthin content, making it impossible to achieve high-throughput, rapid online screening. Furthermore, it is difficult to fully identify abnormal samples in spectral data, and the prediction models lack robustness and generalization ability, and there is a lack of assessment of the reliability of the prediction results.
Multispectral imaging technology was used to collect spectral data of chili pepper samples. Outliers were removed by a two-step method combining multivariate statistics and model residual analysis. Multiple machine learning regression models were constructed through cross-validation, and the preferred prediction model was selected. A weighted neighborhood consistency index was introduced to evaluate the reliability of the prediction.
This technology enables non-destructive and rapid detection of capsanthin content, improves detection efficiency and the robustness and accuracy of the prediction model, provides a reliability assessment of the prediction results, and enhances the decision support capabilities for practical applications.
Smart Images

Figure CN122238232A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of non-destructive rapid detection and spectral analysis technology of crop components, specifically a rapid detection method for capsanthin content based on multispectral data. Background Technology
[0002] Capsaicin, an important natural carotenoid pigment, is widely used in food processing, cosmetics, and pharmaceuticals. Its content directly affects the color, quality, and commercial value of chili peppers and their processed products. Traditionally, the determination of capsaicin content mainly relies on laboratory chemical analysis methods such as high-performance liquid chromatography (HPLC). While these methods are highly accurate, they typically require complex sample pretreatment (such as extraction and purification), which is destructive, time-consuming, labor-intensive, and costly, and is difficult to meet the needs of rapid screening of large batches of samples in production settings.
[0003] With the development of optical sensing and spectral analysis technologies, non-destructive testing techniques such as near-infrared spectroscopy and hyperspectral imaging have shown potential in the quantitative analysis of agricultural product components. In particular, multispectral imaging technology can simultaneously acquire images and spectral information of samples in multiple specific spectral bands, possessing both spatial distribution and spectral feature analysis capabilities, making rapid and visual detection of the internal chemical components of chili peppers possible. However, due to the complex factors inherent in chili pepper samples, such as morphology, surface condition, and internal structural heterogeneity, as well as interference from environmental noise and equipment drift, directly and robustly reflecting capsanthin content from multispectral data remains a challenge. How to effectively remove outlier samples from the raw spectral data, construct robust prediction models, and conduct reliability assessments of prediction results for unknown samples are key aspects for the practical application of this method.
[0004] In recent years, machine learning and statistical modeling methods have been increasingly applied to spectral data analysis. By establishing nonlinear mapping relationships between spectral features and target chemical components, it is hoped that the prediction accuracy and generalization ability of detection models can be further improved. Therefore, researching a multispectral detection method that integrates high-quality data preprocessing, intelligent anomaly sample screening, multi-model optimization, and prediction reliability assessment is of significant research and application value for achieving rapid, accurate, and non-destructive detection of capsanthin content and improving the intelligent level of chili quality sorting and processing.
[0005] The following problems exist in the existing technology: While existing mainstream chemical detection methods (such as high performance liquid chromatography) are highly accurate, they rely on complex sample pretreatment, which is destructive, time-consuming, and labor-intensive, and cannot achieve high-throughput, online rapid screening and sorting. When using spectral technology for non-destructive testing, abnormal samples are easily mixed in the raw spectral data due to factors such as individual sample differences, environmental noise, and equipment instability. Existing methods often use single or simple outlier removal strategies (such as those based on standard deviation thresholds), which are difficult to comprehensively and accurately identify interference samples caused by spectral anomalies or inconsistent physicochemical properties, resulting in insufficient robustness and generalization ability of the constructed prediction models. Existing methods often rely on experience to pre-select a single type of regression model (such as partial least squares regression) when constructing predictive models. This approach fails to systematically evaluate and compare the performance differences of different types of algorithms on specific datasets, which may prevent the full utilization of the data's potential and make it difficult to achieve optimal predictive accuracy. Existing spectral prediction methods typically output only a single predicted value, lacking a quantitative evaluation index for the reliability of the prediction result. When the spectral characteristics of the sample to be tested deviate significantly from the modeled sample distribution, the reliability of the predicted value is questionable, but existing technologies cannot provide users with this risk warning, affecting the reliability of decision-making in practical applications. Summary of the Invention
[0006] The present invention aims to solve at least one of the technical problems existing in the prior art; to this end, the present invention proposes a rapid detection method for capsanthin content based on multispectral data to solve the above-mentioned technical problem.
[0007] The first aspect of this invention provides a rapid detection method for capsanthin content based on multispectral data, comprising the following steps: S1: Collect spectral reflectance data of chili pepper samples in the visible and near-infrared bands using a multispectral imaging device to construct the original spectral dataset; S2: The original spectral dataset is standardized sequentially, and then outliers are removed using a two-step method based on multivariate statistics and model residual analysis to obtain a sample set for modeling. S3: The modeling sample set is divided into multiple training and test sets using cross-validation; multiple machine learning regression models are constructed based on each training set, and the preferred prediction model for detecting capsanthin content is selected by comparing the prediction performance of each model on the corresponding test set. S4: For the chili pepper sample to be tested, its spectral reflectance data is collected by a multispectral imaging device, and the collected spectral reflectance data is standardized and outlier is removed to obtain standard spectral data. The standard spectral data is then input into the preferred prediction model to obtain the quantitative prediction result of capsanthin content of the chili pepper sample to be tested.
[0008] Preferably, step S1 includes the following steps: A hyperspectral imager is used as the multispectral imaging device; Before collecting spectral reflectance data from chili pepper samples, the equipment was radiometrically calibrated using a standard whiteboard and blackboard to establish the conversion relationship between the original measurement signal and the absolute reflectance. Based on this conversion relationship, a multispectral imaging device was used to repeatedly collect spectral reflectance data from each chili pepper sample. The average value of the spectral reflectance data collected multiple times for each chili pepper sample is calculated to obtain the reflectance data of each chili pepper sample at multiple wavelengths; the reflectance data of all chili pepper samples are collected to form the original spectral dataset.
[0009] Preferably, step S2 includes the following steps: The reflectance data of each chili pepper sample in the original spectral dataset are standardized so that the mean of the data for each wavelength is 0 and the standard deviation is 1. Based on the standardized reflectance data, the first step of outlier removal based on multivariate statistical methods is performed by calculating the multivariate statistical outlier index for each chili pepper sample. Based on the chili sample data retained after the first step of outlier removal and the known reference values of capsanthin content, an initial prediction model is constructed. The residual is obtained based on the difference between the predicted value calculated by the initial prediction model and the actual reference value. The second step of outlier removal based on the model residual analysis method is then performed based on the residual. Finally, the remaining chili pepper samples after the first and second outlier removal steps constitute the sample set for modeling.
[0010] Preferably, the first step of outlier removal based on multivariate statistical methods includes the following steps: The multivariate statistical anomaly index is identified as the interspectral correlation entropy (ISCE). Using standardized reflectance data, the ISCE for each chili pepper sample is calculated using the following formula: Where c is the chili pepper sample index, and M is the number of wavelengths. Let be the Pearson correlation coefficient between wavelength points i and j across all chili pepper samples. For all the sum of It is a very small constant; Chili pepper samples with an interspectral correlation entropy (ISCE) value higher than a preset threshold are removed.
[0011] Preferably, the second step of outlier removal based on the model residual analysis method includes the following steps: Based on the chili sample data retained after the first step of outlier removal using multivariate statistical methods and their known capsanthin content reference values, a partial least squares regression model is constructed. This model predicts the capsanthin content of each chili sample, and the predicted value is compared with the actual reference value to calculate the standardized prediction residual for each chili sample. ; Meanwhile, the Isolation Forest algorithm is used to perform unsupervised anomaly detection on the reflectance data of the pepper samples retained after the first step of outlier removal based on multivariate statistical methods, and the standardized outlier score of each pepper sample is obtained. ; Standardized prediction residuals based on each chili pepper sample Standardized outlier scores Calculate its comprehensive modeling interference score The calculation formula is: in, and These are the scaling correction parameters for the standardized prediction residuals and the standardized outlier score distributions, respectively. The preset cross-term coupling coefficient; The comprehensive modeling interference score Chili pepper samples exceeding a preset unified threshold are identified as interference samples for modeling and are removed.
[0012] Preferably, step S3 includes the following steps: The five-fold cross-validation method is used to divide the modeling sample set into five mutually exclusive and equally distributed subsets. In a rotating manner, one subset is used as the test set in turn, and the other four subsets are combined as the training set, thus forming five different combinations of training and test sets. For each of the five training sets, four models were constructed: multiple linear regression model, partial least squares regression model, random forest regression model, and convolutional neural network regression model. Using the test set corresponding to each training set, the four regression models built on that training set were evaluated, and the average performance index of each regression model on the five test sets was obtained. By comparing the average performance index of the four regression models, the regression model type with the best average prediction performance was selected as the preferred prediction model for detecting capsanthin content.
[0013] Preferably, step S4 includes the following steps: For the chili pepper sample to be tested, its spectral reflectance data is acquired using a multispectral imaging device. The acquired spectral reflectance data is then standardized and outlier is removed to obtain standard spectral data. The standard spectral data is then input into the preferred prediction model to obtain preliminary prediction values. Based on standard spectral data and the sample set used for modeling, a neighborhood reliability assessment was performed, and a weighted neighborhood consistency index (WNCI) was calculated. Finally, the quantitative prediction results of capsanthin content in the pepper sample to be tested, including the preliminary predicted value and the weighted neighborhood consistency index (WNCI), were obtained.
[0014] Preferably, the calculation of the weighted neighborhood consistency index (WNCI) includes the following steps: In the spectral space of the modeling sample set, calculate the Mahalanobis distance between the standard spectral data and the spectral data of each sample in the modeling sample set, and select the distance with the smallest distance. Each sample is used as a neighborhood sample; the Mahalanobis distance between each neighborhood sample and the standard spectral data is used as the basis for the analysis. Calculate its Gaussian weights ,in This is the preset kernel width parameter; according to Reference values for known capsanthin content in a neighborhood sample and their corresponding weights Calculate its weighted standard deviation : based on The reference values for capsanthin content of neighboring samples were used to divide their value range into: Given three equal-width intervals, calculate the sum of the weights of samples falling into each interval. The weighted spectral interference entropy (SIE) was calculated as follows: in, It is a very small smoothing constant; Combined with weighted standard deviation The weighted neighborhood consistency index (WNCI) is obtained by combining the spectral interference entropy (SIE) with the spectral interference entropy (SIE). The calculation formula is as follows: in, The global range of the sample set used for modeling. Reference values for capsanthin content in all chili pepper samples in the sample set used for modeling. The set, and These are the preset regularization parameters.
[0015] Compared with the prior art, the beneficial effects of the present invention are: This invention is based on multispectral imaging technology, which can quickly acquire spectral information without destructive preprocessing of chili pepper samples. Combined with automated data processing and model prediction processes, it significantly improves detection efficiency and provides a feasible technical means for rapid screening and quality sorting of capsanthin content on the production line. This invention employs a two-step outlier removal strategy based on multivariate statistics (spectral correlation entropy) and model residual analysis. This strategy can more comprehensively and accurately identify and remove outlier samples from the perspective of spectral correlation and prediction residual synergy, thereby constructing a purer and higher-quality modeling sample set and fundamentally enhancing the robustness and reliability of the subsequent prediction model. This invention employs a cross-validation framework and systematically constructs and compares various machine learning regression models such as multiple linear regression, partial least squares regression, random forest, and convolutional neural networks. Through data-driven performance comparison, the "preferred prediction model" is selected, ensuring optimal fitting to specific data features, thereby achieving a more accurate and stable prediction capability for capsanthin content. In the final prediction stage, this invention not only outputs the predicted content value but also innovatively introduces the "Weighted Neighborhood Consistency Index (WNCI)" to quantify and evaluate the reliability of the prediction. This index integrates the statistical consistency and distribution disorder of the predicted value within the spectral neighborhood, providing users with an intuitive reference for prediction confidence, helping to identify high-risk prediction results, and enhancing the practical value and decision support capability of the entire method in real-world applications. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of the method flow of the present invention. Detailed Implementation
[0017] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.
[0018] Please see Figure 1 This invention provides a rapid detection method for capsanthin content based on multispectral data, comprising the following steps: S1: Collect spectral reflectance data of chili pepper samples in the visible and near-infrared bands using a multispectral imaging device to construct the original spectral dataset; S2: The original spectral dataset is standardized sequentially, and then outliers are removed using a two-step method based on multivariate statistics and model residual analysis to obtain a sample set for modeling. S3: The modeling sample set is divided into multiple training and test sets using cross-validation; multiple machine learning regression models are constructed based on each training set, and the preferred prediction model for detecting capsanthin content is selected by comparing the prediction performance of each model on the corresponding test set. S4: For the chili pepper sample to be tested, its spectral reflectance data is collected by a multispectral imaging device, and the collected spectral reflectance data is standardized and outlier is removed to obtain standard spectral data. The standard spectral data is then input into the preferred prediction model to obtain the quantitative prediction result of capsanthin content of the chili pepper sample to be tested.
[0019] Specifically, firstly, spectral reflectance data of chili pepper samples in the visible and near-infrared bands are collected using a multispectral imaging device to construct a raw spectral dataset. Next, the raw spectral data is standardized, and a two-step strategy combining multivariate statistical analysis and model residual analysis is employed to effectively identify and remove outliers, thus obtaining a high-quality sample set for modeling. Then, cross-validation is used to divide the modeling sample set into multiple training and test sets. Based on each training set, various types of machine learning regression models are constructed, and the predictive performance of each model is systematically compared on the corresponding test sets. Finally, the optimal model for predicting capsanthin content is selected as the preferred prediction model. Finally, for unknown chili pepper samples, spectral data is collected using the same multispectral device. After standardization and outlier removal preprocessing consistent with the modeling stage, the obtained standardized spectral data is input into the previously selected preferred prediction model to obtain a quantitative prediction of the capsanthin content of the sample. This entire method realizes a complete process from spectral data acquisition, cleaning, modeling to final prediction, aiming to achieve rapid, non-destructive, and accurate capsanthin content detection.
[0020] In one embodiment of the present invention, step S1 includes the following steps: A hyperspectral imager is used as the multispectral imaging device; Before collecting spectral reflectance data from chili pepper samples, the equipment was radiometrically calibrated using a standard whiteboard and blackboard to establish the conversion relationship between the original measurement signal and the absolute reflectance. Based on this conversion relationship, a multispectral imaging device was used to repeatedly collect spectral reflectance data from each chili pepper sample. The average value of the spectral reflectance data collected multiple times for each chili pepper sample is calculated to obtain the reflectance data of each chili pepper sample at multiple wavelengths; the reflectance data of all chili pepper samples are collected to form the original spectral dataset.
[0021] Specifically, regarding the varieties and types of chili pepper samples, a wide range of chili pepper fruits should be collected, including but not limited to different cultivated varieties (such as long peppers, facing-heaven peppers, and sweet peppers), different maturity stages (such as green-ripe, color-changing, and fully ripe), and different origins. This diversity aims to ensure that the spectral characteristics of the sample set and the corresponding capsanthin content range can fully cover the variations that may be encountered in practical application scenarios. In terms of the physical morphology and preparation of chili pepper samples, the selected samples should be intact in appearance, free from mechanical damage, disease, or rot. Before collecting spectral data, the chili pepper samples need to undergo uniform pretreatment. Typical pretreatment includes cleaning the sample surface with a soft, damp cloth to remove dust and dirt, followed by natural air drying in a cool, ventilated place to eliminate the interference of surface moisture on spectral reflectance characteristics. Regarding the number of chili pepper samples, to ensure that the total number N of chili pepper samples for constructing the original spectral dataset reaches a sufficient scale, it should generally be no less than 200, preferably more than 500; for example, in one specific embodiment, a total of 1051 chili pepper samples from multiple varieties and different origins can be collected.
[0022] This method uses a hyperspectral imager as a multispectral imaging device to accurately capture the spectral reflectance characteristics of chili pepper samples. Hyperspectral imaging technology can simultaneously acquire spatial information and continuous spectral information of chili pepper samples. Compared with traditional multispectral devices, it has higher spectral resolution and can more meticulously reflect subtle spectral features related to capsanthin content. In specific implementation, the operating wavelength range of the selected hyperspectral imager should cover the visible to near-infrared region, for example, 400 nm to 1000 nm. The spectral resolution of the device can be set as needed, for example, better than 5 nm, while the spatial resolution is adapted to the sample size and detection accuracy requirements.
[0023] Before acquiring spectral reflectance data from chili pepper samples, the hyperspectral imager undergoes radiometric calibration to eliminate the influence of factors such as the device's own dark current and light source inhomogeneity, converting the raw digital signal (DN value) acquired by the device into a physically meaningful absolute reflectance value. Dark reference data acquisition (blackboard calibration) involves pointing the hyperspectral imager lens at a standard blackboard (usually a diffuse reflectance cavity or a specially made black cloth) with a spectral reflectance close to 0%. Under the exact same environment and parameter settings (such as integration time and aperture) as the sample acquisition, an image is acquired to obtain the dark reference value. For example, in a 12-bit system, the typical value is between 10 and 100; this data mainly contains the device's dark current noise. Acquiring white reference data (whiteboard calibration): Under the same conditions, point the lens at a standard diffuse whiteboard with known high reflectivity (typically close to 99%), acquire an image, and obtain the white reference value. For example, in a 12-bit system (range 0-4095), the typical value range is 3200 to 3900, with a common setting of approximately 3800; ideally, the white board reflects almost all the incident light. A reflectivity conversion model is established for the raw values of each subsequently acquired chili pepper sample. Its value range is determined by the bit depth of the camera's analog-to-digital converter (ADC). For example, for a 12-bit ADC, the range is 0 to 4095; the actual value depends on the sample reflectance characteristics and integration time. The formula for calculating the absolute reflectance R of each pixel in the image at a certain wavelength is: ;in, The calibrated reflectance value of a standard whiteboard at this wavelength is a known constant close to 1. In the visible-near-infrared band (e.g., 400-1000 nm), commercially available hyperspectral calibrated whiteboards... The value is typically between 0.97 and 1.00; for example, the nominal value for a common PTFE whiteboard is approximately 0.99.
[0024] To ensure the stability and representativeness of the collected data and reduce the impact of random noise, spectral reflectance data need to be collected multiple times for each chili pepper sample; in practice, the number of times each sample is collected is... The number of acquisitions can be determined based on the signal-to-noise ratio requirements, typically ranging from 3 to 5. During each acquisition, the sample can be slightly moved or rotated to obtain spectral information from different parts of the sample, improving the spatial representativeness of the data. After completing data acquisition for all chili pepper samples, for each individual chili pepper sample... The reflectance data obtained from repeated acquisitions and after radiometric calibration (each data point being the reflectance value at a series of wavelengths) are averaged point-by-wavelength (i.e., band-by-band). Let the reflectance of the c-th chili pepper sample at the i-th wavelength point in the k-th acquisition be... The average reflectance of the chili pepper sample at that wavelength is... The calculation formula is: .
[0025] The final spectral data of all N chili pepper samples are collected. The data for each chili pepper sample is a vector containing the average reflectance at M wavelengths, in the form of... Arrange the vectors of all chili pepper samples row by row to form the original spectral dataset. Specifically, it can be represented as an N×M matrix: In this matrix, each row corresponds to a unique chili pepper sample, and each column corresponds to a specific wavelength point (or spectral band); The spectral characteristics of all chili pepper samples were fully preserved.
[0026] In one embodiment of the present invention, step S2 includes the following steps: The reflectance data of each chili pepper sample in the original spectral dataset are standardized so that the mean of the data for each wavelength is 0 and the standard deviation is 1. Based on the standardized reflectance data, the first step of outlier removal based on multivariate statistical methods is performed by calculating the multivariate statistical outlier index for each chili pepper sample. Based on the chili sample data retained after the first step of outlier removal and the known reference values of capsanthin content, an initial prediction model is constructed. The residual is obtained based on the difference between the predicted value calculated by the initial prediction model and the actual reference value. The second step of outlier removal based on the model residual analysis method is then performed based on the residual. Finally, the remaining chili pepper samples after the first and second outlier removal steps constitute the sample set for modeling.
[0027] Specifically, the original spectral dataset is standardized; this dataset is an N×M matrix. Where N is the total number of chili pepper samples and M is the number of wavelength points collected; the elements in the matrix This represents the original reflectance value of the c-th sample at the i-th wavelength. The purpose of standardization is to eliminate the dimensional and scale effects caused by differences in instrument response or illumination intensity across different wavelength dimensions.
[0028] The processing is performed independently for each wavelength i; for the i-th wavelength, the arithmetic mean of the reflectance values of all N chili pepper samples at that wavelength is calculated. and sample standard deviation The calculation formulas are as follows: ;in, Using N−1 as the denominator is the standard practice for an unbiased estimate of the sample standard deviation; subsequently, the raw values of c for each chili pepper sample at wavelength i are calculated. Perform the transformation to obtain the standardized value: After this transformation, for any wavelength i, the standardized data of all chili pepper samples... The resulting set will have a mean of 0 and a standard deviation of 1. After traversing all M wavelengths, a standardized spectral data matrix Z is generated, which still has a dimension of N×M.
[0029] The first step of outlier removal based on multivariate statistical methods is performed on a standardized spectral data matrix Z. Its goal is to identify and remove "spectral anomalies" that significantly deviate from the overall distribution of the main sample in the multivariate spectral space (i.e., a high-dimensional space composed of all M wavelength dimensions). This invention quantifies the degree of anomaly for each sample by calculating a specific multivariate statistical anomaly index. The core of this index lies in assessing whether there is a significant deviation between the overall correlation pattern of the spectral curve shape of each chili pepper sample and the curve shapes of other samples in the dataset. After calculating the multivariate statistical anomaly index values for all chili pepper samples, samples with index values higher than a preset threshold are judged as anomalies and removed. This step mainly filters out spectral anomalies caused by serious measurement errors, atypical samples, or contamination.
[0030] The second step of outlier removal based on model residual analysis is performed on the chili sample data retained from the first step of outlier removal and its known capsanthin content reference values. Its goal is to further identify and remove "modeling interference" samples that, while not showing abnormalities in the spectral space, exhibit a significant difference between their spectral characteristics and the actual correlation between capsanthin content and the main patterns in the dataset. The specific steps are as follows: First, an initial prediction model is constructed using the retained chili sample data, and the prediction residual for each chili sample is calculated. Simultaneously, an unsupervised anomaly detection algorithm is used to analyze the spectral data of the samples. Then, based on the prediction residuals and the unsupervised anomaly score, a comprehensive modeling interference score is calculated, which considers both the sample's error in the supervised prediction model and its anomaly from an unsupervised perspective. Finally, based on a preset unified threshold, samples with a comprehensive score higher than this threshold are identified as modeling interference samples and removed.
[0031] After the first step of outlier removal based on multivariate statistical methods and the second step of outlier removal based on model residual analysis, the remaining chili sample data and their corresponding capsanthin content reference values together constitute the modeling sample set used for subsequent model training, validation and screening.
[0032] In one embodiment of the present invention, the first step of outlier removal based on multivariate statistical methods includes the following steps: The multivariate statistical anomaly index is identified as the interspectral correlation entropy (ISCE). Using standardized reflectance data, the ISCE for each chili pepper sample is calculated using the following formula: Where c is the chili pepper sample index, and M is the number of wavelengths. Let be the Pearson correlation coefficient between wavelength points i and j across all chili pepper samples. For all the sum of It is a very small constant; Chili pepper samples with an interspectral correlation entropy (ISCE) value higher than a preset threshold are removed.
[0033] Specifically, based on the standardized spectral data matrix Z (with dimensions N×M), the first step of outlier removal using multivariate statistical methods is performed. The core of this step is to calculate the interspectral correlation entropy for each chili pepper sample. As a multivariate statistical anomaly indicator, it is used to identify and remove spectral anomaly samples.
[0034] Interspectral Correlation Entropy This is an index used to quantify the degree of deviation of the c-th chili pepper sample from the overall spectral correlation pattern. Its calculation formula is: Where c is the index of the chili pepper sample, ranging from 1 to N, where N is the total number of samples. M is the total number of wavelength points (or bands) collected, the value of which is determined by the spectral resolution and wavelength range of the hyperspectral imager; for example, when the instrument samples at 5-nanometer intervals in the 400-1000 nanometer range, M is approximately 120; the typical value of M is usually between tens and hundreds, depending on the equipment configuration. Let be the Pearson correlation coefficient between wavelengths i and j across all N chili pepper samples, calculated using the normalized reflectance data. This coefficient measures the linear correlation between the spectral responses at two wavelengths over the entire dataset. The values of i and j both range from 1 to M. The calculation is based on the i-th column vector of the normalized data matrix Z. and the j-th column vector The calculation formula is: ,in and These are the mean values of the i-th and j-th columns, respectively. The range of values for is [−1, 1]; its absolute value The closer to 1, the more consistent the spectral trends of the two wavelengths (positive or negative correlation); the closer to 0, the weaker the linear relationship. For all The sum, since i ranges from 1 to M−1 and j ranges from i+1 to M, totals... There are several different wavelength pairs; therefore This sum serves as a global normalization constant, ensuring that each weight term in the formula... The sum of these values equals 1, thus forming a probability distribution. (Smoothing constant) It is a very small normal number, added inside the logarithmic function to prevent it from being used as a weight term. When the value is extremely close to 0, logarithmic operations may result in numerical underflow or tend towards negative infinity; its typical value is... Or smaller, for example .
[0035] Calculate the results of all N chili pepper samples. After setting the values, a preset threshold needs to be established to identify anomalies. The purpose of setting the threshold is to distinguish between normal samples and samples with spectral anomalies; a commonly used and robust method for setting the threshold is based on the statistical distribution characteristics of the ISCE values; first, the mean of the ISCE values of all samples is calculated. and standard deviation Then, set the threshold. Set as ,in, It is a positive multiplier factor; The value of determines the strictness of the rejection process, typically ranging from 2.5 to 3.5. For example, taking ... When the threshold is 3, it roughly corresponds to the 99.7% confidence upper limit under a normal distribution, meaning that approximately 0.3% of extreme values may be eliminated. Another approach is to use the percentile method, directly setting the threshold to a higher percentile of all ISCE values, such as the 95th, 97.5th, or 99th percentile. This method is more direct and does not depend on the specific shape of the distribution. In practical applications, the threshold setting method can be chosen based on the size and requirements of the specific dataset. For example, for a dataset containing approximately 1000 samples, the threshold can be set to the 97.5th percentile of the ISCE values of all pepper samples arranged in ascending order. Setting the threshold Then, iterate through all samples to find the one that satisfies the condition. > The chili pepper samples were identified as spectral anomalies and removed from the current dataset. These removed samples usually correspond to situations where there were serious errors in the measurement process, physical defects in the sample itself (such as local rot or severe contamination), or varietal characteristics that differed greatly from the main dataset.
[0036] After this step, a portion of the original N chili pepper samples is removed, and the remaining chili pepper samples constitute a data subset with more consistent spectral characteristics, laying the foundation for the subsequent second step of outlier removal based on model residuals.
[0037] In one embodiment of the present invention, the second step of outlier removal based on the model residual analysis method includes the following steps: Based on the chili sample data retained after the first step of outlier removal using multivariate statistical methods and their known capsanthin content reference values, a partial least squares regression model is constructed. This model predicts the capsanthin content of each chili sample, and the predicted value is compared with the actual reference value to calculate the standardized prediction residual for each chili sample. ; Meanwhile, the Isolation Forest algorithm is used to perform unsupervised anomaly detection on the reflectance data of the pepper samples retained after the first step of outlier removal based on multivariate statistical methods, and the standardized outlier score of each pepper sample is obtained. ; Standardized prediction residuals based on each chili pepper sample Standardized outlier scores Calculate its comprehensive modeling interference score The calculation formula is: in, and These are the scaling correction parameters for the standardized prediction residuals and the standardized outlier score distributions, respectively. The preset cross-term coupling coefficient; The comprehensive modeling interference score Chili pepper samples exceeding a preset unified threshold are identified as interference samples for modeling and are removed.
[0038] Specifically, a second outlier removal step is performed on the chili pepper sample data retained after the first outlier removal step, based on a model residual analysis method. The input includes a standardized spectral data submatrix. (dimension is) ,in The vector of the number of samples remaining after the first step of elimination (where M is the number of wavelengths) and their known capsanthin content reference values. .
[0039] Known reference values for capsanthin content (Where c is the sample index, ranging from 1 to N) refers to the precise quantitative result obtained after destructive testing of the c-th chili pepper sample using standard chemical analysis methods; it is usually determined using high-performance liquid chromatography (HPLC), which involves extracting the chili pepper sample with an organic solvent (such as acetone), separating it through a chromatographic column, and detecting the characteristic absorption peak of capsanthin at a specific wavelength (such as 460 nm or 472 nm). Quantification is then performed by comparing the result with a standard. The result is expressed as milligrams (mg / g) of capsanthin per gram of dry or fresh weight; for example, a reference value for a typical chili pepper sample. The concentration ranges from 0.5 mg / g to 5.0 mg / g, with values reaching 10 mg / g or higher for high-pigment varieties. In one specific embodiment of the invention, the capsanthin content reference values of 1051 initial samples used were determined by HPLC, and the distribution ranged from approximately 0.8 mg / g to 8.3 mg / g.
[0040] Based on data ( , An initial prediction model is constructed, specifically a partial least squares regression (PLSR) model. This PLSR model extracts a submatrix of standardized spectral data. A linear regression relationship is established using the latent direction (called the latent variable or principal component) that has the largest covariance with the known capsanthin content reference value vector y. The number of latent variables to be determined is denoted as L, which is also often referred to as the number of principal components. L is a key hyperparameter of the model, and its selection must strike a balance between model complexity and predictive power. Too small a L will lead to underfitting, failing to capture sufficient spectral information; too large a L will lead to overfitting, reducing the robustness of predictions for unknown samples. The optimal L is usually determined through internal cross-validation. Specifically, the training set (… , The dataset is further randomly divided into multiple subsets (e.g., 10 subsets). One subset is used as the validation set, and the rest are used as the training subsets. PLSR models are constructed using different L values (e.g., from 1 to 20), and the prediction error (e.g., root mean square error) of the model on the validation set is calculated. Finally, the L value that minimizes the average prediction error or reaches a plateau is selected as the optimal value. In one embodiment, the optimal principal component number L determined through this process is 6.
[0041] Using the determined optimal number of principal components L and chili sample data ( , The partial least squares regression model is trained using standard algorithms such as nonlinear iterative partial least squares algorithms, ultimately yielding model parameters including a weight matrix, loading matrix, and regression coefficient vector b. The weight matrix defines the contribution weights of the original spectral variables to the latent variables, while the loading matrix describes the relationship between the latent variables and the original spectral variables. The regression coefficient vector b is an M-dimensional vector (i.e., ...). This directly establishes a linear mapping relationship between standardized spectral data and predicted content; each element in b For the i-th wavelength, its value represents the contribution weight of the standardized reflectance at that wavelength to the final predicted capsanthin content; the specific value of b is obtained during model training using a least-squares optimization algorithm, with the aim of minimizing the sum of squared errors between the predicted and actual reference values; since the input spectral data has been standardized (mean 0, standard deviation 1), the elements of b... There are usually no absolute limits on the range of values, but their magnitude directly reflects the importance of the corresponding wavelength feature; in the actual trained model, Typical values for this value may range from -10 to 10, depending on the strength of the correlation between wavelength and capsanthin content; for example, in a model built for a specific variety, a wavelength (such as around 550 nm) associated with the characteristic absorption of capsanthin corresponds to... The coefficient is 0.85, while the coefficient for an unrelated wavelength is close to 0. The trained PLSR model is used to predict the capsanthin content of each chili pepper sample c used for modeling; the prediction process involves analyzing the normalized spectral row vector of the sample. (Right now The input model is in the c-th row, and the calculation formula is as follows: ,in To predict the content, The intercept term of the PLSR model; intercept It is a scalar constant whose value is determined together with the regression coefficient vector b during model training, with the aim of minimizing the prediction residual. The numerical value matches the order of magnitude of known capsanthin content reference values. For example, when the known capsanthin content reference value is in mg / g and ranges from 0.5 to 5.0, The typical range of values for is likely between -2 and 2. All values can be obtained in this way. The predicted value vector of each sample .
[0042] Calculate the original prediction residual for each chili pepper sample. Then, all raw residuals are standardized to eliminate dimensions. First, the mean of the residuals is calculated. and standard deviation Then calculate the standardized prediction residual for each sample. After standardization, The absolute value of the error directly reflects the degree of deviation of the prediction error of this sample from the overall sample error distribution.
[0043] Simultaneously, the isolated forest algorithm is used to process the spectral data matrix. For unsupervised anomaly detection, the Isolation Forest algorithm detects anomalies by constructing multiple isolated trees. For anomalies containing... For a dataset of samples, the construction process for each isolated tree is as follows: A subset of samples is randomly selected (the subsample size ψ is typically set to 256 or less). Then, a feature (i.e., wavelength) and a random splitting value are recursively selected to partition the sample into the left or right subtree, until the set tree depth limit is reached or the sample is isolated (i.e., the sample point exists alone in a node). For any given sample... (Right now The path length of the c-th row in the isolated tree. Defined as the number of edges traversed from the root node to the point where the sample is isolated. This is achieved by constructing... such an isolated tree ( (The preset number of trees, usually set to 100), samples raw outlier score It is calculated based on the average path length in these trees; calculation sample exist Average path length in isolated trees Original outlier score The calculation formula is: ;in, The normalization term used to standardize path length is where the sample size is . The average path length of a binary search tree at time t is calculated using the following formula: , It is a harmonic number, which can be used Estimate ( (where is Euler's constant, approximately 0.5772). According to this formula, The value ranges from 0 to 1; the closer the score is to 1, the shorter the average path length of the isolated sample, and the higher the probability of it being an anomaly; the closer the score is to 0, the more likely the sample is a normal sample. The original anomaly scores are then standardized, and their mean is calculated. and standard deviation Then, the standardized outlier score is obtained. .
[0044] Standardized prediction residuals based on each chili pepper sample Standardized outlier scores Calculate its comprehensive modeling interference score The calculation formula is: Among them, the scale correction parameters and Used for adjustment and The relative weights in the overall score can be set to 1 by default; if the predicted residuals are considered a more reliable indicator of anomalies, they can be set to... A value less than 1 (e.g., 0.8) increases its contribution; conversely, a value greater than 1 also increases it; typical values range from 0.5 to 2.0. Cross-term coupling coefficient. The interaction between control residuals and outlier scores is controlled, with values ranging from [−1, 1]; when When >0, it means that when and When the same sign (i.e., both indicate abnormality or both indicate normality) is present, their synergistic effect is amplified. ;when When the value is less than 0, this synergistic effect is suppressed; it is usually set to 0 (no interaction) or a small positive value (such as 0.2) to moderately account for synergy.
[0045] Calculate the overall modeling interference score for all pepper samples. Then, a uniform discrimination threshold is set. This discrimination threshold can be determined using statistical methods, such as calculating all... mean and standard deviation ,make ,in Typically, a value between 2.0 and 3.0 is used (e.g., 2.5); or percentiles can be used directly, such as the 95th percentile. All values that meet the criteria are considered. > The chili pepper sample was identified as a modeling interference sample and was removed from ( , These chili pepper samples are typically excluded from the final modeling sample set. These samples usually have no obvious outliers in their spectra, but their spectral-content relationships deviate significantly from the main patterns, or their reference values have potential problems. After this second step of anomaly removal, the remaining chili pepper samples constitute a highly pure final modeling sample set.
[0046] In one embodiment of the present invention, step S3 includes the following steps: The five-fold cross-validation method is used to divide the modeling sample set into five mutually exclusive and equally distributed subsets. In a rotating manner, one subset is used as the test set in turn, and the other four subsets are combined as the training set, thus forming five different combinations of training and test sets. For each of the five training sets, four models were constructed: multiple linear regression model, partial least squares regression model, random forest regression model, and convolutional neural network regression model. Using the test set corresponding to each training set, the four regression models built on that training set were evaluated, and the average performance index of each regression model on the five test sets was obtained. By comparing the average performance index of the four regression models, the regression model type with the best average prediction performance was selected as the preferred prediction model for detecting capsanthin content.
[0047] Specifically, the modeling sample set includes the final retained samples. Standardized spectral data matrix of a chili pepper sample (dimension is) (×M) and its corresponding known capsanthin content reference value vector .
[0048] First, a five-fold cross-validation method is used to divide the modeling sample set into subsets for training and validation. The goal of this division is to ensure that each subset is balanced in terms of sample size and representative in terms of data distribution. Specifically, the total number of subsets is... A sample of chili peppers is randomly shuffled and then sequentially and approximately equally divided into five mutually exclusive subsets, denoted as . The number of samples contained in each subset As close as possible, usually For example, when When the sample size is 919, each of the four subsets can contain 184 samples, and one subset contains 183 samples.
[0049] In each round of the five-fold cross-validation, one subset is designated as the test set, while the remaining four subsets are combined to form the training set. This results in five different combinations of training and test sets. Specifically, in the first round: the training set is... The union of the test set is Second round: Training set is The union of the test set is Third round: Training set is The union of the test set is Fourth round: Training set is The union of the test set is Fifth round: Training set is The union of the test set is For each combination, the training set is used to build a machine learning regression model (predictive model), while the corresponding test set is used to evaluate the predictive performance of the predictive model, ensuring the fairness of the evaluation. For each of the five training sets, four types of regression models are independently built: multiple linear regression model, partial least squares regression model, random forest regression model, and convolutional neural network regression model.
[0050] The multiple linear regression model specifically assumes that the capsaicin content... Normalized reflectivity with M wavelengths There is a linear relationship between them, and its model form is: ,in, The intercept is... For regression coefficients, This is the error term; using training set data (assuming the number of training set samples is...). The regression coefficients are solved using ordinary least squares, i.e., minimizing the sum of squared residuals. , This is the known capsanthin content reference value for the r-th chili pepper sample in the training set. The value is the model's predicted value for the chili pepper sample. Since the spectral data dimension M may be high and there may be collinearity among variables, regularization methods (such as ridge regression) can be used to enhance the stability of the model in actual construction, but the basic model is still the above-mentioned multiple linear form.
[0051] Partial least squares regression models establish regression relationships by extracting the latent variable (principal component) with the largest covariance between spectral data and known reference values for capsanthin content; for the training set, let its spectral data matrix be... (dimension) ×M), the known reference value vector for capsanthin content is... (dimension) ×1). This partial least squares regression model extracts L latent variables, where L is the key hyperparameter; the optimal value of L can be determined by cross-validation within the training set, with a typical search range of 1 to 20; in one embodiment, the determined optimal value of L is 6. The final form of this partial least squares regression model is: ,in For the regression coefficient vector, The intercept is... This is the vector of predicted values for the training set samples by the model.
[0052] Random forest regression is an ensemble learning method that uses multiple decision trees to perform regression predictions by combining their predictions (e.g., averaging). Each decision tree is trained using a bootstrap sample from the training set, and only a subset of features (wavelengths) is considered randomly at each node split. Key hyperparameters include the number of decision trees. Maximum depth of each tree and the number of features considered when splitting nodes . The typical value range is from 100 to 500, for example, set to 300; It can be set to no limit or a large value such as 20; Usually set to or Random forest regression models grow by minimizing the prediction error of each tree.
[0053] Convolutional Neural Network (CNN) regression models use one-dimensional CNNs to process spectral data sequences. The network input is an M-dimensional normalized spectral vector of a single sample. A typical network structure includes one-dimensional convolutional layers, activation function layers, pooling layers, flattening layers, and fully connected layers. One-dimensional convolutional layers use multiple one-dimensional kernels to perform convolution operations along the spectral dimension to extract local spectral features; the typical number of kernels is 32 or 64, and the typical kernel size (receptive field) is 3 or 5. Activation function layers apply non-linear activation functions, such as ReLU (Rectified Linear Unit), to the convolutional output. Pooling layers optionally use one-dimensional max pooling or average pooling for downsampling; the pooling size is typically 2. Flattening layers flatten the multi-dimensional feature map into a one-dimensional vector. Typically, it consists of one or more fully connected layers, ultimately outputting a scalar value as a predicted value for capsanthin content; the network is trained using a backpropagation algorithm and an optimizer (such as Adam), and the loss function is usually the mean squared error; the typical number of training epochs is 50 to 200, and the batch size can be 16 or 32.
[0054] For each of the five training-test set combinations, independently evaluate the four models built on that training set using its test set. Let the number of samples in the test set be... The known reference value for capsanthin content in the test set samples is [value missing]. The model's prediction value for the test set is The evaluation calculates the performance metrics for each model on the test set, primarily using the coefficient of determination. Root mean square error (RMSE). Coefficient of determination. The calculation formula is ,in This represents the mean of the true content in the test set; The closer a value is to 1, the stronger the model's ability to explain variance. The formula for calculating the root mean square error (RMSE) is: Its dimensions are the same as capsanthin content (e.g., mg / g), and the smaller the value, the higher the prediction accuracy. For each type of regression model, its performance on five test sets is calculated. arithmetic mean of values and the arithmetic mean of RMSE values .
[0055] The preferred predictive model is selected by comparing the average performance metrics of the four regression model types; the selection criteria are typically based on... highest and The lowest model type; if there are trade-offs between metrics (e.g., a model) Highest but If it is not the minimum, then the selection can be made based on the emphasis on accuracy or error in the actual application, or by combining both indicators (such as...). - The normalized difference (ND) is used for evaluation. The regression model type with the best average predictive performance is ultimately selected as the preferred predictive model for detecting capsanthin content. For example, in one specific embodiment, the random forest regression model obtained an average ND of [value missing] in five-fold cross-validation. =0.96, average =0.02mg / g, which is superior to the other three models, and therefore selected as the preferred prediction model. The type of this preferred prediction model and the final model parameters obtained by retraining on the entire modeling sample set will be used for subsequent quantitative prediction of capsanthin content in the test samples.
[0056] In one embodiment of the present invention, step S4 includes the following steps: For the chili pepper sample to be tested, its spectral reflectance data is acquired using a multispectral imaging device. The acquired spectral reflectance data is then standardized and outlier is removed to obtain standard spectral data. The standard spectral data is then input into the preferred prediction model to obtain preliminary prediction values. Based on standard spectral data and the sample set used for modeling, a neighborhood reliability assessment was performed, and a weighted neighborhood consistency index (WNCI) was calculated. Finally, the quantitative prediction results of capsanthin content in the pepper sample to be tested, including the preliminary predicted value and the weighted neighborhood consistency index (WNCI), were obtained.
[0057] Specifically, based on a defined preferred prediction model and modeling sample set, rapid, non-destructive quantitative detection is performed on new, unknown-content chili pepper samples.
[0058] Multispectral imaging equipment was used to acquire multispectral data from the pepper samples under test. The equipment needed to operate under ambient lighting conditions similar to those used when acquiring the sample set for modeling. To ensure data quality, radiometric calibration was performed before acquisition, and the samples were repeatedly acquired to calculate the average spectrum. Let the original reflectance vector of the sample after acquisition, radiometric calibration, and averaging be... Where M is the number of wavelengths. Subsequently, this original spectral data is standardized; the mean vector and standard deviation vector used for standardization are derived from the parameters calculated and stored during the standardization of the original modeling spectral dataset in claim 3; the original reflectance of the sample under test at the i-th wavelength... Its standardized value The calculation formula is: After processing, the standardized spectral vector of the sample to be tested is obtained. .
[0059] Standardized spectral data of the pepper samples to be tested Execute outlier detection and removal logic. Identify whether the pepper sample to be tested is an extremely anomalous pepper sample, using the outlier removal model established in claims 4 and 5 as the criterion for determination. Calculate the interspectral correlation entropy of the sample to be tested for spectral anomaly detection. (The calculation method is the same as in claim 4). If Greater than the spectral anomaly threshold set in claim 4 If so, the sample is determined to have a spectral anomaly. Modeling interference checks will... Substitute the initial partial least squares regression model and the isolated forest model trained in claim 5 into the standardized prediction residuals and calculate them. and standardized outlier scores Then, its comprehensive modeling interference score is calculated. (The calculation method is the same as in claim 5). If Greater than the modeling interference threshold set in claim 5 If the sample is not identified as abnormal in the above two steps, then its standardized spectral vector is considered an interference sample in the modeling process. Standard spectral data deemed valid will proceed to the next prediction step. If an anomaly is detected, the system will issue a warning, indicating that the test results may be unreliable.
[0060] Standard spectral data The input is fed into the preferred prediction model selected in claim 6. This model performs forward computation according to its specific type (e.g., multiple linear regression, partial least squares regression, random forest regression, or convolutional neural network regression), and outputs a preliminary predicted value of the capsanthin content of the sample to be tested, denoted as... Its unit is consistent with the unit of the content reference value used in modeling, such as milligrams per gram of fresh weight (mg / g).
[0061] To provide users with preliminary forecast values A quantitative reference for reliability is based on the aforementioned standard spectral data. Using the modeling sample set, a neighborhood reliability assessment is performed. This assessment calculates a Weighted Neighborhood Consistency Index (WNCI). The WNCI is a value between 0 and 1, characterizing the concentration and consistency of capsanthin content distribution among known samples within the "neighborhood" of the tested chili pepper sample in its spectral characteristic space. A higher WNCI value indicates a more stable prediction environment and a more accurate preliminary prediction. The higher the credibility, the better.
[0062] Finally, the quantitative prediction result of the capsanthin content in the chili pepper sample was output in a combination of two elements, namely ( WNCI), among which The WNCI is a preliminary quantitative prediction of capsanthin content, and its reliability index is the WNCI value. Users can judge the reliability of the prediction result based on the WNCI value. For example, when WNCI > 0.7, the prediction result can be considered highly reliable, while when WNCI < 0.3, it suggests that the result should be treated with caution or the sample and measurement process should be re-examined.
[0063] In one embodiment of the present invention, the calculation of the weighted neighborhood consistency index (WNCI) includes the following steps: In the spectral space of the modeling sample set, calculate the Mahalanobis distance between the standard spectral data and the spectral data of each sample in the modeling sample set, and select the distance with the smallest distance. Each sample is used as a neighborhood sample; the Mahalanobis distance between each neighborhood sample and the standard spectral data is used as the basis for the analysis. Calculate its Gaussian weights ,in This is the preset kernel width parameter; according to Reference values for known capsanthin content in a neighborhood sample and their corresponding weights Calculate its weighted standard deviation : based on The reference values for capsanthin content of neighboring samples were used to divide their value range into: Given three equal-width intervals, calculate the sum of the weights of samples falling into each interval. The weighted spectral interference entropy (SIE) was calculated as follows: in, It is a very small smoothing constant; Combined with weighted standard deviation The weighted neighborhood consistency index (WNCI) is obtained by combining the spectral interference entropy (SIE) with the spectral interference entropy (SIE). The calculation formula is as follows: in, The global range of the sample set used for modeling. Reference values for capsanthin content in all chili pepper samples in the sample set used for modeling. The set, and These are the preset regularization parameters.
[0064] Specifically, the weighted neighborhood consistency index (WNCI) is calculated using standard spectral data. and the sample set used for modeling (containing spectral data matrix) and its corresponding known capsanthin content reference value vector Based on this, the reliability of predictions is evaluated by analyzing the neighborhood structure of the pepper sample to be tested in the spectral space.
[0065] First, in the spectral space of the sample set used for modeling (i.e. In Zhang Cheng's M-dimensional space, calculate the normalized spectral vector of the sample to be tested. With each modeling sample spectral vector (Right now The Mahalanobis distance between the spectral variables (row o) is used to effectively account for the correlation between spectral variables. Its calculation formula is: Where S is the spectral data matrix of the sample set used for modeling. The M×M covariance matrix, It is its inverse matrix; the elements of the covariance matrix S Calculated as ,in and , i and j, are the mean values of the chili pepper samples across all modeling sample sets, respectively. Using Mahalanobis distance, we can find the chili pepper sample in the modeling sample set that is most similar to the chili pepper sample to be tested, considering the correlation between spectral bands.
[0066] Calculate all Mahalanobis distance Then, select the one with the smallest distance. Each modeling sample is used as a neighborhood sample. It is a preset neighborhood size, the value of which should be able to form a statistically significant local sample set, while avoiding the inclusion of irrelevant distant neighbors; The typical value range is from 10 to 50, for example, it can be set to 20; one setting principle is to make it approximately equal to the total number of modeling samples. 1% to 5%.
[0067] For the selected v-th neighborhood sample (v=1,2,…), Based on its Mahalanobis distance Calculate its Gaussian weights The formula is ;in The preset kernel width parameter is used to control the rate at which the weights decay with distance; The larger the value, the slower the weight decay, meaning that samples further away can also obtain relatively high weights; The smaller the value, the faster the weight decays, emphasizing the role of nearest neighbors; The value of is usually related to the squared scale of the distance within the neighborhood. A common heuristic setting is to take all . The median of the neighborhood distances ,make Its typical range of values makes Between 0 and 10.
[0068] based on Reference values for known capsanthin content in a neighborhood sample (o=1,2,…, ) and their corresponding Gaussian weights First, calculate the weighted average. : Weighted mean This reflects the central tendency of the content values within the neighborhood; then the weighted standard deviation is calculated. It is used to measure the dispersion of content values around the weighted mean within a neighborhood, and the calculation formula is: Weighted standard deviation The smaller the value, the more concentrated the capsanthin content values of the samples in the neighborhood are, the more consistent the prediction environment of the chili pepper samples to be tested, and the more reliable the prediction results are in theory. Dimensions and known reference values for capsanthin content Same (e.g., mg / g).
[0069] To assess the disorder or orderliness of the distribution of neighborhood sample content values, the spectral interference entropy (SIE) is calculated. First, the spectral interference entropy (SIE) is... Reference values for capsanthin content in one neighborhood sample The range of values (i.e., from) arrive ) divided into A number of equal-width intervals. This is the preset number of intervals used for discretizing the distribution; its typical value range is 5 to 15. For example, it can be taken as... =10; the interval width is .
[0070] For the l-th interval (l=1,2,…), The sum of the weights of all its neighboring samples is denoted as . ,Right now Then, these interval weights are summed and normalized to obtain the relative weight of each interval. Obviously, .
[0071] Spectral Interference Entropy (SIE) is based on this normalized weight distribution { Calculated according to the information entropy formula: ;in, It is a very small positive constant, called the smoothing constant, used to prevent smoothing when a certain... Logarithmic operations are undefined when the value is 0; The typical value is Or even smaller. Spectral Interference Entropy (SIE) measures the uniformity of the distribution of content values in a neighborhood sample; if all content values are concentrated in a very few intervals (orderly distribution), the SIE value is low (close to 0); if the content values are uniformly distributed across all intervals (chaotic distribution), the SIE value is high (the maximum theoretical value is approximately...). A lower SIE value means that the content variation pattern within the neighborhood is simpler and the prediction is more reliable.
[0072] Finally, combine the weighted standard deviation. The weighted neighborhood consistency index (WNCI) is calculated from the spectral interference entropy (SIE) and the spectral interference entropy (SIE). The formula for WNCI is as follows: ;in, It is the global range of the sample set used for modeling, used to adjust the weighted standard deviation. Normalization is performed to eliminate the influence of the overall content range. and These are preset regularization parameters, all of which are positive numbers, used to adjust the relative contribution strength of the weighted standard deviation term and the spectral interference entropy term in the final exponent. The contribution of the control dispersion term typically ranges from 0.5 to 3.0; for example, when When = 1.0, it means that the normalized discreteness is used directly. ); increase This will reduce the impact of the dispersion term. The contribution of the controllable ordinal term typically ranges from 0.2 to 2.0; for example, it can be taken as... =0.5; increase This will enhance the effect of ordered terms. In one specific embodiment, it can be set... =1.0, =0.5.
[0073] The exponential function exp is used to apply a non-negative penalty term. Mapped to the (0,1] interval. When the content within the neighborhood is highly concentrated ( When the dispersion is very large or the distribution is very orderly (SIE→0), the penalty term approaches 0 and WNCI→1, indicating the highest reliability. Conversely, when the dispersion is large or the distribution is very chaotic, the penalty term increases and the WNCI value decreases, approaching 0, indicating low reliability.
[0074] The above embodiments are only used to illustrate the technical methods of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical methods of the present invention without departing from the spirit and scope of the technical methods of the present invention.
Claims
1. A rapid detection method of lycopene content based on multispectral data, characterized in that, Includes the following steps: S1: Collect spectral reflectance data of chili pepper samples in the visible and near-infrared bands using a multispectral imaging device to construct the original spectral dataset; S2: The original spectral dataset is standardized sequentially, and then outliers are removed using a two-step method based on multivariate statistics and model residual analysis to obtain a sample set for modeling. S3: The modeling sample set is divided into multiple training and test sets using cross-validation; multiple machine learning regression models are constructed based on each training set, and the preferred prediction model for detecting capsanthin content is selected by comparing the prediction performance of each model on the corresponding test set. S4: For the chili pepper sample to be tested, its spectral reflectance data is collected by a multispectral imaging device, and the collected spectral reflectance data is standardized and outlier is removed to obtain standard spectral data. The standard spectral data is then input into the preferred prediction model to obtain the quantitative prediction result of capsanthin content of the chili pepper sample to be tested.
2. The rapid detection method for capsanthin content based on multispectral data according to claim 1, characterized in that, S1 includes the following steps: A hyperspectral imager is used as the multispectral imaging device; Before collecting spectral reflectance data from chili pepper samples, the equipment was radiometrically calibrated using a standard whiteboard and blackboard to establish the conversion relationship between the original measurement signal and the absolute reflectance. Based on this conversion relationship, a multispectral imaging device was used to repeatedly collect spectral reflectance data from each chili pepper sample. The average value of the spectral reflectance data collected multiple times for each chili pepper sample is calculated to obtain the reflectance data of each chili pepper sample at multiple wavelengths; the reflectance data of all chili pepper samples are collected to form the original spectral dataset.
3. The rapid detection method for capsanthin content based on multispectral data according to claim 1, characterized in that, S2 includes the following steps: The reflectance data of each chili pepper sample in the original spectral dataset are standardized so that the mean of the data for each wavelength is 0 and the standard deviation is 1. Based on the standardized reflectance data, the first step of outlier removal based on multivariate statistical methods is performed by calculating the multivariate statistical outlier index for each chili pepper sample. Based on the chili sample data retained after the first step of outlier removal and the known reference values of capsanthin content, an initial prediction model is constructed. The residual is obtained based on the difference between the predicted value calculated by the initial prediction model and the actual reference value. The second step of outlier removal based on the model residual analysis method is then performed based on the residual. Finally, the remaining chili pepper samples after the first and second outlier removal steps constitute the sample set for modeling.
4. The rapid detection method for capsanthin content based on multispectral data according to claim 3, characterized in that, The first step of outlier removal based on multivariate statistical methods includes the following steps: The multivariate statistical anomaly index is identified as the interspectral correlation entropy (ISCE). Using standardized reflectance data, the ISCE for each chili pepper sample is calculated using the following formula: Where c is the chili pepper sample index, and M is the number of wavelengths. Let be the Pearson correlation coefficient between wavelength points i and j across all chili pepper samples. For all The sum of It is a very small constant; Chili pepper samples with an interspectral correlation entropy (ISCE) value higher than a preset threshold were removed.
5. The rapid detection method for capsanthin content based on multispectral data according to claim 3, characterized in that, The second step of the outlier removal method based on model residual analysis includes the following steps: Based on the chili sample data retained after the first step of outlier removal using multivariate statistical methods and their known capsanthin content reference values, a partial least squares regression model is constructed. This model predicts the capsanthin content of each chili sample, and the predicted value is compared with the actual reference value to calculate the standardized prediction residual for each chili sample. ; Meanwhile, the Isolation Forest algorithm is used to perform unsupervised anomaly detection on the reflectance data of the pepper samples retained after the first step of outlier removal based on multivariate statistical methods, and the standardized outlier score of each pepper sample is obtained. ; Standardized prediction residuals based on each chili pepper sample Standardized outlier scores Calculate its comprehensive modeling interference score The calculation formula is: in, and These are the scaling correction parameters for the standardized prediction residuals and the standardized outlier score distributions, respectively. The preset cross-term coupling coefficient; The comprehensive modeling interference score Chili pepper samples that exceed a preset unified threshold are identified as interference samples for modeling and are removed.
6. The rapid detection method for capsanthin content based on multispectral data according to claim 1, characterized in that, S3 includes the following steps: The five-fold cross-validation method is used to divide the modeling sample set into five mutually exclusive and equally distributed subsets. In a rotating manner, one subset is used as the test set in turn, and the other four subsets are combined as the training set, thus forming five different combinations of training and test sets. For each of the five training sets, four models were constructed: multiple linear regression model, partial least squares regression model, random forest regression model, and convolutional neural network regression model. Using the test set corresponding to each training set, the four regression models built on that training set were evaluated, and the average performance index of each regression model on the five test sets was obtained. By comparing the average performance index of the four regression models, the regression model type with the best average prediction performance was selected as the preferred prediction model for detecting capsanthin content.
7. The rapid detection method for capsanthin content based on multispectral data according to claim 1, characterized in that, S4 includes the following steps: For the chili pepper sample to be tested, its spectral reflectance data is acquired using a multispectral imaging device. The acquired spectral reflectance data is then standardized and outlier is removed to obtain standard spectral data. The standard spectral data is then input into the preferred prediction model to obtain preliminary prediction values. Based on standard spectral data and the sample set used for modeling, a neighborhood reliability assessment was performed, and a weighted neighborhood consistency index (WNCI) was calculated. Finally, the quantitative prediction results of capsanthin content in the pepper sample to be tested, including the preliminary predicted value and the weighted neighborhood consistency index (WNCI), were obtained.
8. The rapid detection method for capsanthin content based on multispectral data according to claim 7, characterized in that, The calculation of the weighted neighborhood consistency index (WNCI) includes the following steps: In the spectral space of the modeling sample set, calculate the Mahalanobis distance between the standard spectral data and the spectral data of each sample in the modeling sample set, and select the distance with the smallest distance. Each sample is used as a neighborhood sample; the Mahalanobis distance between each neighborhood sample and the standard spectral data is used as the basis for the analysis. Calculate its Gaussian weights ,in This is the preset kernel width parameter; according to Reference values for known capsanthin content in a neighborhood sample and their corresponding weights Calculate its weighted standard deviation : based on The reference values for capsanthin content of neighboring samples were used to divide their value range into: Given three equal-width intervals, calculate the sum of the weights of samples falling into each interval. The weighted spectral interference entropy (SIE) was calculated as follows: in, It is a very small smoothing constant; Combined with weighted standard deviation The weighted neighborhood consistency index (WNCI) is obtained by combining the spectral interference entropy (SIE) with the spectral interference entropy (SIE). The calculation formula is as follows: in, The global range of the sample set used for modeling. Reference values for capsanthin content in all chili pepper samples in the sample set used for modeling. The set, and These are the preset regularization parameters.