Fertilizer nutrient online rapid detection method and system based on near infrared spectrum
By iteratively screening wavelength variables and optimizing cross-validation, the high dimensionality and uneven sample distribution problems of fertilizer nutrient detection in near-infrared spectroscopy were solved, achieving high accuracy and stability in fertilizer nutrient detection and simplifying the model structure.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HENAN XINLIANXIN FERTILIZER TESTING CO LTD
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-30
AI Technical Summary
Existing near-infrared spectroscopy technology for fertilizer nutrient detection suffers from high dimensionality, multicollinearity, redundant variables, and noise interference, leading to increased model complexity, decreased prediction accuracy and generalization ability. The random sampling method of k-fold cross-validation results in an unbalanced sample distribution, affecting the reliability and accuracy of the model.
A method of multiple iterations to screen wavelength variables was adopted, combined with an exponential decay function and optimized k-fold cross-validation. Through two-dimensional spatial mapping of leverage value and Mahalanobis distance, the representativeness of sample distribution was ensured, redundant variables were eliminated, and a quantitative prediction model was established.
It improved the prediction accuracy and stability of the model, enhanced the reliability and generalization ability of fertilizer nutrient detection, and simplified the model complexity.
Smart Images

Figure CN122306745A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of detection, and in particular relates to an online rapid detection method and system for fertilizer nutrients based on near-infrared spectroscopy. Background Technology
[0002] Near-infrared spectroscopy (NIRS) technology, by collecting spectral information from fertilizer samples and combining it with chemometric methods to construct quantitative models, enables rapid detection and prediction of fertilizer nutrient content. However, NIRS data typically contains hundreds or even thousands of wavelength variables, exhibiting high dimensionality, multicollinearity, redundant variables, and noise interference. Using full-spectrum data for modeling increases model complexity, introduces interfering information, and reduces predictive accuracy. Variable selection methods, such as the Continuous Projection Algorithm (SPA), Competitive Reweighted Sampling (CARS), and Uninformative Variable Elimination (UVE), employ different strategies to remove irrelevant variables and extract characteristic spectra relevant to the target components. In the above variable selection process, k-fold cross-validation is usually used to select the optimal wavelength subset and evaluate the modeling effect. However, k-fold cross-validation uses random sampling to divide the dataset, which cannot guarantee the balance and representativeness of the distribution of each data compromise sample. When the calibration set contains outlier samples and points with strong influence, random partitioning is prone to causing some subset samples to cluster or become unbalanced. This leads to bias and instability in the evaluation results of the root mean square error (RMSECV) of cross-validation, which cannot reflect the true predictive ability of the model. This affects the reliability of the optimal feature wavelength selection and restricts the accuracy and generalization ability of the quantitative prediction model. Summary of the Invention
[0003] This invention provides a method and system for rapid online detection of fertilizer nutrients based on near-infrared spectroscopy.
[0004] According to one aspect of the present invention, a method for rapid online detection of fertilizer nutrients based on near-infrared spectroscopy is provided, the method comprising the following steps: Online near-infrared spectral data of fertilizer samples to be tested are acquired, a calibration set containing spectral data and corresponding nutrient concentrations is established, and the spectral data is preprocessed. The process involves multiple iterations to screen wavelength variables. Each iteration includes: randomly sampling from the calibration set to construct a sample subset, and establishing a partial least squares regression model based on the subset; extracting the absolute values of the regression coefficients of each wavelength variable in the regression model as importance weights; calculating the number of wavelength variables to be retained in the current iteration using an exponential decay function, and retaining variables with high weights according to the importance weights to obtain a subset of candidate wavelength variables; and calculating the root mean square error of the cross-validation of the subset of candidate wavelength variables using an optimized k-fold cross-validation method. The optimization method includes: calculating the leverage value and Mahalanobis distance of all samples in the calibration set, mapping the samples to a two-dimensional space composed of two parameters; dividing the space into grids, and using the grids as strata, all samples are allocated to k data folds through stratified random sampling to ensure the representativeness of the distribution of data in each fold. Based on the root mean square error of cross-validation calculated in all iterations, the subset of candidate wavelength variables with the smallest error is selected as the optimal feature spectrum. A quantitative prediction model is established based on the optimal characteristic spectrum and the corresponding nutrient concentration.
[0005] According to another aspect of the present invention, an online rapid detection system for fertilizer nutrients based on near-infrared spectroscopy is provided, the system comprising the following modules: The preprocessing module is used to acquire online near-infrared spectral data of fertilizer samples to be tested, establish a calibration set containing spectral data and corresponding nutrient concentrations, and preprocess the spectral data. The screening module is used to perform multiple iterations to screen wavelength variables. Each iteration includes: randomly sampling from the calibration set to construct a sample subset, and establishing a partial least squares regression model based on the subset; extracting the absolute values of the regression coefficients of each wavelength variable in the regression model as importance weights; calculating the number of wavelength variables to be retained in the current iteration using an exponential decay function, and retaining variables with high weights according to the importance weights to obtain a subset of candidate wavelength variables; calculating the root mean square error of the cross-validation of the subset of candidate wavelength variables using an optimized k-fold cross-validation method; the optimization method includes: calculating the leverage value and Mahalanobis distance of all samples in the calibration set, mapping the samples to a two-dimensional space composed of two parameters; dividing the space into grids, and using the grids as layers, all samples are allocated to k data folds through stratified random sampling to ensure the representativeness of the distribution of data in each fold; The selection module is used to select the subset of candidate wavelength variables with the smallest error as the optimal feature spectrum based on the root mean square error of cross-validation calculated in all iterations. A module is established to build a quantitative prediction model based on the optimal characteristic spectrum and the corresponding nutrient concentration.
[0006] This invention, by combining an iterative screening strategy with an exponential decay function, can eliminate redundant and irrelevant variables in spectral data, simplifying model complexity. Utilizing an optimized cross-validation method based on leverage and Mahalanobis distance, it ensures the balanced distribution and representativeness of each data fold in cross-validation through two-dimensional spatial mapping and gridded hierarchical sampling, making the performance evaluation results of candidate wavelength combinations more objective and reliable. The quantitative prediction model established based on the selected optimal feature spectra improves the model's prediction accuracy, stability, and generalization ability, providing reliable technical support for online detection of fertilizer nutrients. Attached Figure Description
[0007] Figure 1 This is a flowchart of a method for rapid online detection of fertilizer nutrients based on near-infrared spectroscopy; Figure 2 A scatter plot of the Mahalanobis distance-lever value sample distribution; Figure 3 A schematic diagram for determining the optimal number of principal components. Detailed Implementation
[0008] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.
[0009] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.
[0010] In this invention, see Figure 1 As shown, a rapid online detection method for fertilizer nutrients based on near-infrared spectroscopy includes the following steps: S1. Acquire online near-infrared spectral data of the fertilizer sample to be tested, establish a calibration set containing spectral data and corresponding nutrient concentrations, and preprocess the spectral data. Spectra were acquired online using a Fourier transform near-infrared spectrometer, typically ranging from 900 to 2500 nanometers. Nitrogen, phosphorus, and potassium nutrient concentrations were determined using chemical analysis methods such as the Kjeldahl method, molybdenum-antimony spectrophotometry, and flame photometry as reference values to construct a calibration set matrix. A standard normal variable transformation algorithm was employed to eliminate the effects of particle size and surface scattering. The Savitzky-Gore smoothing algorithm was then used to smooth and denoise the spectra, and the first derivative of the spectrum was calculated to subtract baseline drift.
[0011] In some embodiments of the present invention, the preprocessing of the spectral data includes: The effects of sample surface scattering and optical path variation are eliminated by using standard normal variable transformation; The transformed spectrum was processed using the Savitzky-Golay smoothing derivative method.
[0012] For each sample in the calibration set, the near-infrared spectral vector, for example, a 1×401 row vector containing absorbance values at 401 wavelengths, is used to calculate the mean and standard deviation of all absorbance values within the vector. The mean is subtracted from each absorbance value in the vector, and the result is then divided by the standard deviation. Centering and scaling are applied to each spectrum to correct for baseline drift and optical path variations caused by uneven fertilizer particle size and compaction density. SG smoothing and differentiation are applied, and the SNV-processed spectra are convolved through a sliding window with a width of 11 wavelength data points. Within each window, a second-order polynomial is used to perform least-squares fitting on the 11 points, and the second derivative of this polynomial at the center of the window is calculated as the new spectral value for that point. The selected parameters—smoothing window 11, polynomial order 2, and differentiation order 2—are a commonly used empirical combination in chemometrics, achieving an optimal balance between noise filtering and effective information extraction. The above process can not only filter out high-frequency noise in the spectrum, but also separate overlapping peaks and eliminate baseline effects through second-order differentiation, thereby highlighting subtle spectral features related to nutrient concentration.
[0013] S2, perform multiple iterations to screen wavelength variables. Each iteration includes: randomly sampling from the calibration set to construct a sample subset, and establishing a partial least squares regression model based on the subset; Using the Monte Carlo sampling approach, samples are randomly drawn without replacement from the calibration set at a predetermined ratio, such as 80%, to form a subset of samples for the current iteration. Partial least squares regression is then applied to this subset, and through internal cross-validation, such as 7-fold cross-validation, the optimal number of latent variables is determined, thereby establishing a temporary partial least squares regression model.
[0014] In some embodiments of the present invention, constructing a sample subset by randomly sampling from the calibration set includes: Set the total number of iterations; In each iteration, 80% of the samples are randomly drawn from the calibration set without replacement as a subset of samples used to build the partial least squares regression model.
[0015] The partial least squares regression model is a linear network model that extracts a series of latent variables, i.e., principal components, by decomposing the input spectral matrix X and the output concentration matrix Y. A linear relationship is established between the latent variable score matrix T of X and the latent variable score matrix U of Y. Based on this internal relationship, a global linear regression equation from the original input to the output is constructed. ,in Let E be the calculated regression coefficient vector, and E be the residual matrix.
[0016] The network input is a spectral data matrix X of a subset of samples. Each row of this matrix represents the spectrum of a fertilizer sample, and each column represents the absorbance value at a specific wavelength. The network output is the predicted nutrient concentration vector. Each element in this vector corresponds to a predicted concentration value for an input sample.
[0017] The total number of iterations is set to 1000. Assume the calibration set contains M fertilizer samples, for example, M=200. In each iteration, from the 1st to the 1000th, initially, 160 samples (80% of the 200) are randomly and without replacement drawn from these 200 samples to construct a temporary subset. The spectral data and corresponding nutrient concentration values of this subset will be used to build a temporary partial least squares regression model. By using different random subsets in each of the 1000 iterations, the importance of each wavelength variable under different data distributions can be evaluated, avoiding selection bias caused by interference from a few special samples, thus enabling the selected characteristic spectra to have better generalization ability.
[0018] S3, extract the absolute value of the regression coefficient of each wavelength variable in the regression model as the importance weight; From the partial least squares regression model established in the previous step, obtain the regression coefficient vector, where each element corresponds to a wavelength variable. Take the absolute value of each regression coefficient value in this vector to obtain the importance weight value of each wavelength variable.
[0019] S4. The number of wavelength variables to be retained in the current iteration is calculated using the exponential decay function, and variables with high weights are retained according to the importance weights to obtain a subset of candidate wavelength variables; Set the initial total number of wavelength variables, the expected number of wavelength variables, and the decay rate. Calculate the number of wavelengths to be retained in this iteration using an exponential decay function. Sort all wavelength variables in descending order of their weight values obtained in the previous step, and select a specified number of wavelengths with the highest weights to form a subset of candidate wavelength variables.
[0020] In some embodiments of the present invention, the step of calculating the number of wavelength variables to be retained in the current iteration using an exponential decay function includes: The number of wavelength variables to be retained in the i-th iteration The result is obtained by rounding the calculation result of the formula a×exp(-k×i), where a is the total number of initial wavelength variables, k is the attenuation rate coefficient, and i is the current iteration number; and the formula with the highest importance weight is retained. One variable.
[0021] An exponential decay function is employed in the iterative process to achieve a search mechanism that progressively reduces the number of candidate wavelength variables from coarse to fine selection. The parameters in the formula are set as follows: 'a' is the initial total number of wavelength variables in the preprocessed spectrum, for example, 401; 'i' is the current iteration number, increasing from 1 to 1000; the decay rate coefficient 'k' is a key parameter, preferably set to 0.005, used to control the speed of variable elimination, ensuring a smooth and sufficient reduction in the number of variables. In the i-th iteration, based on the PLS model established from the current sample subset, the regression coefficients of all 'a' wavelength variables are extracted, and their absolute values are calculated as importance weights. Using the formula... Calculate the number of variables that need to be retained in this iteration. For example, in the 100th iteration, ≈243; at the 500th iteration, ≈33. Sort by weight from highest to lowest, retaining only the most important ones. Given several wavelength variables, we obtain a subset of candidate wavelengths for this iteration.
[0022] S5, for the subset of candidate wavelength variables, the root mean square error of the subset cross-validation is calculated using an optimized k-fold cross-validation method; the optimization method includes: calculating the lever value and Mahalanobis distance of all samples in the calibration set, mapping the samples to a two-dimensional space composed of two parameters; dividing the space into grids, and using the grids as layers, all samples are allocated to k data folds through stratified random sampling to ensure the representativeness of the distribution of data in each fold; Principal component analysis (PCA) was performed on the spectral data of the entire calibration set to extract the principal component scores for all samples. The leverage value and Mahalanobis distance for each sample were calculated based on the score matrix. All sample points were mapped to a two-dimensional space with the leverage value on the x-axis and the Mahalanobis distance on the y-axis. (See [reference]). Figure 2As shown. In an alternative embodiment, the sample spectrum principal component Mahalanobis distance is used as the horizontal axis and the sample nutrient concentration is used as the vertical axis to divide the space into grids. This two-dimensional space is divided into equally divided grids, for example, into a 5x5 grid. The samples in each grid are considered as a layer, and a stratified random sampling method is used to randomly and uniformly distribute the samples in each layer into a preset K data folds, for example, 10 folds. Only the candidate wavelength variable subset selected in the previous step is used to perform K-fold cross-validation, that is, one of the K data folds is used as the validation set, and the remaining K-1 are used as the training set to build a partial least squares model, and the prediction error on the validation set is calculated. The prediction results of all folds are summarized, and the root mean square error (RMSECV) of the cross-validation for the candidate subset is calculated using the root mean square error formula.
[0023] For the i-th sample in the calibration set spectral matrix X The sample leverage value Through formula Perform calculations; The sample Mahalanobis distance Through formula The calculation is performed, where μ is the mean vector of all sample spectra, S is the covariance matrix, and X is the calibration set spectral matrix.
[0024] With the lever value on the horizontal axis and the Mahalanobis distance on the vertical axis, determine the maximum and minimum values of all sample points on both axes. The horizontal and vertical axes are divided into 10 equal intervals, thus dividing the entire two-dimensional space into 100 equal-sized rectangular non-overlapping grids.
[0025] Treat all non-empty grids containing samples as layers; Within each layer, the unassigned samples are randomly sorted and sequentially assigned to the first layer, the second layer, and so on up to the kth layer in a cyclical manner. Then, the cycle continues from the first layer until all samples in the layer have been assigned. Repeat this process for all layers to obtain k cross-validation folds, where k is the number of cross-validation folds.
[0026] To construct a stable and representative cross-validation scheme, a one-time sample diagnosis and partitioning is performed before the first 1000 iterations of screening. Let the full-spectrum data of the calibration set constitute an M×P matrix X, where M is the number of samples and P is the initial total number of wavelengths. Based on the complete matrix X, the cross-validation for each sample is calculated... leverage value and Mahal distance .
[0027] All sample points Mapping to two-dimensional space and determining the range of lever values. and the range of Mahal distance The two ranges are each uniformly divided into 10 intervals, forming 100 rectangular grids. The cross-validation fold number K=10 is set, and each non-empty grid containing samples is considered a layer. Within each layer, samples are randomly and uniformly distributed into 10 data folds using a cyclic allocation method. This process generates a fixed 10-fold data partitioning scheme that is balanced in terms of sample distribution characteristics. Optionally, M is greater than P. If the sample size is small, sample augmentation can be performed, or principal component analysis can be used to reduce the dimensionality of the spectrum to satisfy the above calculations.
[0028] In the subsequent 1000 iterations of the selection process, whenever a subset of candidate wavelength variables is generated, a pre-defined, fixed 10-fold scheme is used to calculate the root mean square error (RMSECV) of the cross-validation. For the j-th fold (j ranges from 1 to 10), this fold is used as the validation set, and the remaining 9 folds are used as the training set. A PLS model is built based on the current candidate wavelength variable subset, and the nutrient concentration of the validation set is predicted, and the RMSECV is calculated. This process is repeated 10 times, each time using a different fold as the validation set. The RMSECV calculated in the 10 iterations is averaged to obtain the RMSECV of the candidate wavelength subset. This strategy of separating data partitioning from iterative evaluation not only ensures that the evaluation criteria for all candidate variable subsets are uniform and fair, but also improves computational efficiency and avoids repeatedly performing costly matrix inversion and sample partitioning operations in each iteration.
[0029] S6. Based on the root mean square error of cross-validation calculated in all iterations, select the subset of candidate wavelength variables with the smallest error as the optimal feature spectrum. Perform a predetermined number of iterative screenings within a loop, for example, 100 times, recording the subset of candidate wavelength variables generated in each iteration and the corresponding root mean square error (RMSECV) of cross-validation. After the iterations are complete, compare all recorded RMSECV values and find the global minimum. The subset of candidate wavelength variables corresponding to this minimum value is determined as the optimal characteristic spectrum.
[0030] S7. Based on the optimal characteristic spectrum and the corresponding nutrient concentration, establish a quantitative prediction model.
[0031] Using the optimal feature spectrum determined in the previous step, the optimal wavelength column is extracted from the spectral data matrix of the original calibration set to obtain a new, reduced-dimensional spectral data matrix. This optimized spectral matrix and the corresponding nutrient concentration reference value are used as input, and the partial least squares regression algorithm is applied again. The optimal number of latent variables is then determined again using K-fold cross-validation on the entire calibration set, establishing a quantitative prediction model for fertilizer nutrients. The optimal feature spectrum of the fertilizer sample to be tested is substituted into the model, and the online detection results are output. Specifically, the online near-infrared spectrum of the unknown fertilizer sample to be tested is acquired and processed using the same preprocessing method as in step S1. Based on the wavelength position corresponding to the optimal feature spectrum determined in step S6, the corresponding wavelength variable data is extracted from the preprocessed spectrum to be tested. The extracted data is input into the quantitative prediction model established in step S7, and the nutrient concentration detection results of the fertilizer sample to be tested are calculated and output in real time.
[0032] In some embodiments of the present invention, establishing a quantitative prediction model based on the optimal characteristic spectrum and the corresponding nutrient concentration includes: The partial least squares regression algorithm was adopted, and the optimal number of principal components was determined by 10-fold cross-validation. The number of principal components that minimized the root mean square error of cross-validation was selected for model building.
[0033] After 1000 iterations of screening, the RMSECV values corresponding to all 1000 candidate wavelength variable subsets are compared, and the subset that minimizes the RMSECV is selected as the optimal feature spectrum. Based on the optimal feature spectrum subset and the nutrient concentrations corresponding to all samples in the calibration set, a quantitative prediction model is constructed. This model uses the Partial Least Squares (PLS) regression algorithm. To determine the key hyperparameters of the model, namely the optimal number of principal components or latent variables, 10-fold cross-validation is used again. Specifically, a search range for the number of principal components is set, for example, from 1 to 20. For each principal component, for example, when the number of principal components is 1, 10-fold cross-validation is used to calculate the RMSECV. This process is repeated for all values from 1 to 20. The curve representing the change of RMSECV with the number of principal components is used to find the principal component that minimizes the global RMSECV value. (See [reference needed]). Figure 3 As shown, denoted as Using the entire calibration set data, including only the optimal characteristic spectrum and the determined optimal number of principal components. Train a PLS model, which is a quantitative model for online detection of nutrient concentration in fertilizer samples.
[0034] This invention also relates to an online rapid detection system for fertilizer nutrients based on near-infrared spectroscopy, comprising the following modules: The preprocessing module is used to acquire online near-infrared spectral data of fertilizer samples to be tested, establish a calibration set containing spectral data and corresponding nutrient concentrations, and preprocess the spectral data. The screening module is used to perform multiple iterations to screen wavelength variables. Each iteration includes: randomly sampling from the calibration set to construct a sample subset, and establishing a partial least squares regression model based on the subset; extracting the absolute values of the regression coefficients of each wavelength variable in the regression model as importance weights; calculating the number of wavelength variables to be retained in the current iteration using an exponential decay function, and retaining variables with high weights according to the importance weights to obtain a subset of candidate wavelength variables; calculating the root mean square error of the cross-validation of the subset of candidate wavelength variables using an optimized k-fold cross-validation method; the optimization method includes: calculating the leverage value and Mahalanobis distance of all samples in the calibration set, mapping the samples to a two-dimensional space composed of two parameters; dividing the space into grids, and using the grids as layers, all samples are allocated to k data folds through stratified random sampling to ensure the representativeness of the distribution of data in each fold; The selection module is used to select the subset of candidate wavelength variables with the smallest error as the optimal feature spectrum based on the root mean square error of cross-validation calculated in all iterations. A module is established to build a quantitative prediction model based on the optimal characteristic spectrum and the corresponding nutrient concentration.
[0035] In some embodiments of the present invention, the preprocessing of the spectral data includes: The effects of sample surface scattering and optical path variation are eliminated by using standard normal variable transformation; The transformed spectrum was processed using the Savitzky-Golay smoothing derivative method.
[0036] In some embodiments of the present invention, constructing a sample subset by randomly sampling from the calibration set includes: Set the total number of iterations; In each iteration, 80% of the samples are randomly drawn from the calibration set without replacement as a subset of samples used to build the partial least squares regression model.
[0037] In some embodiments of the present invention, the step of calculating the number of wavelength variables to be retained in the current iteration using an exponential decay function includes: The number of wavelength variables to be retained in the i-th iteration The result is obtained by rounding the calculation result of the formula a×exp(-k×i), where a is the total number of initial wavelength variables, k is the attenuation rate coefficient, and i is the current iteration number; and the formula with the highest importance weight is retained. One variable.
[0038] In some embodiments of the present invention, the calculation of the lever values and Mahalanobis distances for all samples in the correction set includes: For the i-th sample in the calibration set spectral matrix X The sample leverage value Through formula Perform calculations; The sample Mahalanobis distance Through formula The calculation is performed, where μ is the mean vector of all sample spectra, S is the covariance matrix, and X is the calibration set spectral matrix.
[0039] In some embodiments of the present invention, the grid division of the space includes: With the lever value on the horizontal axis and the Mahalanobis distance on the vertical axis, determine the maximum and minimum values of all sample points on both axes. The horizontal and vertical axes are divided into 10 equal intervals, thus dividing the entire two-dimensional space into 100 equal-sized rectangular non-overlapping grids.
[0040] In some embodiments of the present invention, the step of allocating all samples to k data compromises through stratified random sampling includes: Treat all non-empty grids containing samples as layers; Within each layer, the unassigned samples are randomly sorted and sequentially assigned to the first layer, the second layer, and so on up to the kth layer in a cyclical manner. Then, the cycle continues from the first layer until all samples in the layer have been assigned. Repeat this process for all layers to obtain k cross-validation folds, where k is the number of cross-validation folds.
[0041] In some embodiments of the present invention, establishing a quantitative prediction model based on the optimal characteristic spectrum and the corresponding nutrient concentration includes: The partial least squares regression algorithm was adopted, and the optimal number of principal components was determined by 10-fold cross-validation. The number of principal components that minimized the root mean square error of cross-validation was selected for model building.
[0042] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.
[0043] The functional modules shown in the above-described block diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.
[0044] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.
[0045] The aspects of this application have been described above with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that these instructions, executable via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions / actions specified in one or more blocks of the flowchart illustrations and / or block diagrams. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field-programmable logic circuit. It is also understood that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can also be implemented by dedicated hardware performing the specified functions or actions, or can be implemented by a combination of dedicated hardware and computer instructions.
[0046] The above description is merely a specific implementation of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.
Claims
1. A method for on-line rapid detection of chemical fertilizer nutrients based on near infrared spectroscopy, characterized in that, Includes the following steps: Online near-infrared spectral data of fertilizer samples to be tested are acquired, a calibration set containing spectral data and corresponding nutrient concentrations is established, and the spectral data is preprocessed. The process involves multiple iterations to screen wavelength variables. Each iteration includes: randomly sampling from the calibration set to construct a sample subset, and establishing a partial least squares regression model based on the subset; extracting the absolute values of the regression coefficients of each wavelength variable in the regression model as importance weights; calculating the number of wavelength variables to be retained in the current iteration using an exponential decay function, and retaining variables with high weights according to the importance weights to obtain a subset of candidate wavelength variables; and calculating the root mean square error of the cross-validation of the subset of candidate wavelength variables using an optimized k-fold cross-validation method. The optimization method includes: calculating the leverage value and Mahalanobis distance of all samples in the calibration set, mapping the samples to a two-dimensional space composed of two parameters; dividing the space into grids, and using the grids as strata, all samples are allocated to k data folds through stratified random sampling to ensure the representativeness of the distribution of data in each fold. Based on the root mean square error of cross-validation calculated in all iterations, the subset of candidate wavelength variables with the smallest error is selected as the optimal feature spectrum. A quantitative prediction model is established based on the optimal characteristic spectrum and the corresponding nutrient concentration.
2. The method of claim 1, wherein, The preprocessing of the spectral data includes: The effects of sample surface scattering and optical path variation are eliminated by using standard normal variable transformation; The transformed spectrum was processed using the Savitzky-Golay smoothing derivative method.
3. The method of claim 1, wherein, The step of constructing a sample subset by randomly sampling from the calibration set includes: Set the total number of iterations; In each iteration, 80% of the samples are randomly drawn from the calibration set without replacement as a subset of samples used to build the partial least squares regression model.
4. The method according to claim 1, characterized in that, The calculation of the number of wavelength variables to be retained in the current iteration using the exponential decay function includes: The number of wavelength variables to be retained in the i-th iteration The result is obtained by rounding the calculation result of the formula a×exp(-k×i), where a is the total number of initial wavelength variables, k is the attenuation rate coefficient, and i is the current iteration number; and the formula with the highest importance weight is retained. One variable.
5. The method according to claim 1, characterized in that, The calculation of the lever values and Mahalanobis distances for all samples in the calibration set includes: For the i-th sample in the calibration set spectral matrix X The sample leverage value Through formula Perform calculations; The sample Mahalanobis distance Through formula The calculation is performed, where μ is the mean vector of all sample spectra, S is the covariance matrix, and X is the calibration set spectral matrix.
6. The method according to claim 1, characterized in that, The process of dividing the space into grids includes: With the lever value on the horizontal axis and the Mahalanobis distance on the vertical axis, determine the maximum and minimum values of all sample points on both axes. The horizontal and vertical axes are divided into 10 equal intervals, thus dividing the entire two-dimensional space into 100 equal-sized rectangular non-overlapping grids.
7. The method according to claim 1, characterized in that, The method of allocating all samples to k data folds through stratified random sampling includes: Treat all non-empty grids containing samples as layers; Within each layer, the unassigned samples are randomly sorted and sequentially assigned to the first layer, the second layer, and so on up to the kth layer in a cyclical manner. Then, the cycle continues from the first layer until all samples in the layer have been assigned. Repeat this process for all layers to obtain k cross-validation folds, where k is the number of cross-validation folds.
8. The method according to claim 1, characterized in that, The quantitative prediction model established based on the optimal characteristic spectrum and the corresponding nutrient concentration includes: The partial least squares regression algorithm was adopted, and the optimal number of principal components was determined by 10-fold cross-validation. The number of principal components that minimized the root mean square error of cross-validation was selected for model building.
9. A rapid online detection system for fertilizer nutrients based on near-infrared spectroscopy, characterized in that, Includes the following modules: The preprocessing module is used to acquire online near-infrared spectral data of fertilizer samples to be tested, establish a calibration set containing spectral data and corresponding nutrient concentrations, and preprocess the spectral data. The screening module is used to perform multiple iterations to screen wavelength variables. Each iteration includes: randomly sampling from the calibration set to construct a sample subset, and establishing a partial least squares regression model based on the subset; extracting the absolute values of the regression coefficients of each wavelength variable in the regression model as importance weights; calculating the number of wavelength variables to be retained in the current iteration using an exponential decay function, and retaining variables with high weights according to the importance weights to obtain a subset of candidate wavelength variables; calculating the root mean square error of the cross-validation of the subset of candidate wavelength variables using an optimized k-fold cross-validation method; the optimization method includes: calculating the leverage value and Mahalanobis distance of all samples in the calibration set, mapping the samples to a two-dimensional space composed of two parameters; dividing the space into grids, and using the grids as layers, all samples are allocated to k data folds through stratified random sampling to ensure the representativeness of the distribution of data in each fold; The selection module is used to select the subset of candidate wavelength variables with the smallest error as the optimal feature spectrum based on the root mean square error of cross-validation calculated in all iterations. A module is established to build a quantitative prediction model based on the optimal characteristic spectrum and the corresponding nutrient concentration.
10. The system according to claim 9, characterized in that, The preprocessing of the spectral data includes: The effects of sample surface scattering and optical path variation are eliminated by using standard normal variable transformation; The transformed spectrum was processed using the Savitzky-Golay smoothing derivative method.