A quality control index content detection method for traditional Chinese medicine oral liquid
By employing an integrated learning algorithm that combines intelligent screening of dual-track feature bands and dynamic out-of-bag error gradient optimization, the problem of cumbersome process and insufficient model generalization ability in the quality control index detection of traditional Chinese medicine oral liquids has been solved. This enables non-destructive, rapid, and green simultaneous detection of multiple indicators, making it suitable for efficient quality control of traditional Chinese medicine oral liquids.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG ACAD OF CHINESE MEDICINE
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for quality control index testing of oral liquid traditional Chinese medicine are cumbersome and time-consuming. Furthermore, traditional chromatographic analysis methods consume a large amount of toxic reagents, making it difficult to achieve real-time quality control of large batches of samples. In complex traditional Chinese medicine systems, NIRS technology faces problems such as severe overlap of spectral information, low signal-to-noise ratio, bottlenecks in characteristic wavelength screening algorithms, and insufficient model generalization ability.
An ensemble learning algorithm employing dual-track feature band intelligent screening and dynamic out-of-bag error gradient optimization, combined with an asymmetric cascade optimization mechanism and support vector regression (SVR) deep mapping, is used to construct a quality control index content detection model, enabling rapid detection of multiple active ingredients in traditional Chinese medicine oral liquids via near-infrared spectroscopy.
It achieves non-destructive, rapid, and green simultaneous rapid detection of multiple indicators, significantly improving detection efficiency and accuracy. It can accurately quantify multiple active ingredients in complex Chinese herbal matrices, replacing the cumbersome pretreatment and toxic reagents used in traditional chromatographic analysis.
Smart Images

Figure CN122245490A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of analytical chemistry and artificial intelligence, specifically a method for detecting the content of quality control indicators in traditional Chinese medicine oral liquids. Background Technology
[0002] Codonopsis pilosula, a traditional and precious Chinese medicinal herb, possesses the effects of tonifying the spleen and lungs, replenishing qi, and strengthening the spleen and lungs. Its deep-processed product, "Codonopsis pilosula Oral Liquid," is widely used in clinical and health care fields for treating spleen and lung deficiency, shortness of breath, palpitations, poor appetite, loose stools, and wheezing cough. It is primarily used to treat infantile diarrhea due to spleen deficiency, anemia in obstetrics and gynecology, chronic gastritis, chronic nephritis, and various symptoms of spleen and lung qi deficiency after radiotherapy and chemotherapy. The quality control system of Codonopsis pilosula Oral Liquid highly relies on the determination of the content of various active markers within it. Specifically, chlorogenic acid has antibacterial and antiviral effects; syringin and genipin show outstanding anti-inflammatory and hepatoprotective effects; codonopsis glycosides and codonopsis alcohol are polyyne components unique to Codonopsis pilosula plants, which have regulatory effects on the nervous and immune systems; and atractylodes lactone III has digestive and antitumor activities.
[0003] Currently, when determining the content of the above-mentioned small molecule active ingredients (chlorogenic acid, syringin, genipin, codonopsis glycoside, codonopsis alcohol, and atractylodes lactone III), the national pharmacopoeia and industry standards usually stipulate the use of ultra-high performance liquid chromatography (UHPLC) or high performance liquid chromatography (HPLC). However, this type of chromatographic analysis method has significant limitations: (1) The pretreatment is cumbersome and time-consuming. The sample needs to undergo a series of complex physicochemical steps, such as precise measurement, addition of organic solvent (such as methanol) for ultrasonic extraction, high-speed centrifugation, and microporous membrane filtration. (2) The reagent consumption is large and the environment is polluted: it involves a large amount of toxic and harmful reagents such as acetonitrile, methanol, and glacial acetic acid, which does not conform to the development trend of green chemistry. (3) The detection efficiency bottleneck: the single sample chromatography operation time is long, and it depends on the selection of instruments and chromatographic columns, making it difficult to achieve real-time quality control of large batches of samples.
[0004] NIRS technology primarily detects the overtones and combination absorptions of vibrations of hydrogen-containing groups (such as CH, OH, NH, etc.), offering significant advantages such as being non-destructive, rapid, and requiring no reagents. However, for highly complex mixture systems like traditional Chinese medicine oral liquids, NIRS faces severe challenges: First, there is significant overlap in spectral information and a low signal-to-noise ratio. The strong absorption peaks of water can mask the weak signals of trace active ingredients, and baseline drift phenomena such as instrument physical noise and light scattering effects are significant. Second, there are algorithmic bottlenecks in feature wavelength selection. Existing Variable Importance in Projection (VIP) algorithms can only perform simple threshold truncation based on linear partial least squares (PLSR) models; while the Competitive Adaptive Reweighted Sampling (CARS) algorithm, when faced with high-noise data, is prone to the accidental deletion of weak but crucial feature bands due to the initial randomness of Monte Carlo sampling. Finally, the generalization ability of a single model is limited. Existing chemometrics methods mostly employ a single PLSR model, which cannot fit the complex nonlinear absorbance-concentration relationship present in oral liquids. Even when using nonlinear machine learning algorithms such as random forest (RF) or support vector machine (SVR), they exhibit significant variability in the prediction of different components. Conventional model fusion (such as weighted averaging) often mechanically combines all models, and the "negative transfer" effect of poor-performing base models frequently leads to a decrease in overall accuracy. Summary of the Invention
[0005] To address the technical problems of cumbersome detection procedures, difficulty in extracting near-infrared spectral features, and weak generalization ability of single models in existing technologies, this invention provides a method for detecting the content of quality control indicators in traditional Chinese medicine oral liquids. This method combines an integrated learning algorithm that uses dual-track feature band intelligent screening and dynamic out-of-bag error gradient optimization to overcome background interference, intelligently lock effective spectral features, and possesses extremely strong generalization ability.
[0006] To solve the aforementioned technical problem, the present invention adopts the following technical solution: a method for detecting the content of quality control indicators in traditional Chinese medicine oral liquids, comprising the following steps: S01. Collect samples of oral liquid Chinese medicine to be tested, prepare reference samples and determine the true value of their quality control index content Y, collect the near-infrared spectra of all samples to be tested, and form a spectral matrix X. S02. Data preprocessing: Divide the spectral data and true content values of all samples to be tested into training and testing sets. Construct a quality control index content detection model based on the training and testing sets, including: S21. Variance filtering and preprocessing are performed on the spectral data in the training set. Then, an asymmetric cascade optimization mechanism based on global response anchoring and dynamic elimination is used to extract the core feature subset from the spectral data. S22. Based on the extracted core feature subset, establish a base model pool and obtain the out-of-bag (OOF) prediction matrix of each base model through 5-fold cross-validation. S23. Addressing the inconsistent prediction performance in the base model pool, this invention abandons traditional full-scale linear weighting or subjective fixed threshold screening. Instead, it innovatively designs a dynamic Top-K adaptive fusion architecture based on adaptive truncation of performance inflection points and deep mapping of nonlinear support vector regression (SVR). The process is as follows: Extract the root mean square error (RMSE) values of cross-validation for all base models and arrange them in ascending order (i.e., the smaller the error, the higher the ranking). Generate a first-order difference sequence (error change gradient) of adjacent model errors. Find the largest positive jump point in this difference sequence as the performance mutation inflection point. Automatically use the index corresponding to this inflection point as the truncation boundary (while setting a supplementary mechanism to force the retention of the top two best base models to avoid fusion failure). Only concatenate the OOF prediction values of the high-quality base models before the inflection point with the test set prediction values column-wise to construct new meta-feature training matrices and meta-feature test matrices, forming a Top-K high-quality subset. This mechanism automatically cuts off negative transfer pollution caused by inferior base models from the underlying mathematical principles. S24. For the high-quality base models retained after truncation, their prediction residuals often conceal high-order nonlinear relationships that traditional linear weighting cannot capture. This step focuses on Support Vector Regression (SVR), a single model with extremely strong generalization ability, as the final fusion model. The OOF prediction matrices of the Top-K high-quality subsets are concatenated column-wise to form a Meta-Train matrix. An SVR meta-learner based on a hybrid kernel function is trained on the Meta-Train matrix. The hybrid kernel function is formed by linearly superimposing a Linear kernel and an RBF radial basis kernel with adaptive weights at performance inflection points. S25. Construct a quality control index content detection model consisting of asymmetric cascade optimization, dynamic Top-K truncation, and SVR meta-learner, and adjust the quality control index content detection model based on the test set. S03. The content of quality control indicators of traditional Chinese medicine oral liquid is detected based on the quality control indicator content detection model.
[0007] Step S01: Collect samples of Codonopsis pilosula oral liquid from different batches to ensure the diversity and representativeness of the sample space. Spectral data acquisition is performed using a Fourier transform near-infrared spectrometer (e.g., an Antaris II spectrometer). After powering on and warming up, the oral liquid samples are placed in quartz cuvettes with a fixed optical path (e.g., 1 mm). The spectrometer resolution is set to 8 cm⁻¹. -1 Air was used as the background reference. The scanning spectral range was set to 10000-4000 cm⁻¹. -1To reduce instrument thermal noise and environmental interference, each sample was scanned 32 times consecutively under the same conditions, and the instrument automatically performed spectral averaging. Each sample was measured in parallel three times, and the average spectrum of the three parallel measurements was taken as the final near-infrared spectral response vector for that sample. The spectral vectors of all N samples constitute a dimension of The original near-infrared spectral feature matrix X (where P is the number of wavelength points).
[0008] The method for determining the true value Y of the quality control index content is as follows: The oral liquid sample is accurately measured, extracted with methanol by ultrasonication, centrifuged, and filtered through a microporous membrane to prepare the test solution. Ultra-high performance liquid chromatography (UHPLC) is used with a binary gradient elution using 0.2 wt% glacial acetic acid aqueous solution-acetonitrile as the mobile phase. The peak areas of chlorogenic acid, syringin, genipin, codonopsis glycoside, codonopsis alcohol, and atractylodes lactone III are recorded, and their absolute contents are calculated by substituting them into their respective standard curves.
[0009] To fundamentally eliminate the risk of data leakage during the spectral preprocessing stage, this method associates the near-infrared spectral matrix X with the concentration vector y of a single target component before any spectral transformation or feature selection. To avoid test set distribution bias caused by traditional random partitioning, the SPXY algorithm is used to strictly divide the dataset into training and test sets (in a ratio of 8:2 or 7:3). The implementation steps are as follows: first, the spectral data and true concentration values are normalized; then, the Euclidean distance between any two samples in the spectral space is calculated. Euclidean distance from the concentration space : , Then calculate the normalized composite distance. : , in , Representing samples respectively and samples In the The spectral response values at each wavelength point, where P represents the total number of near-infrared spectral characteristic variables.
[0010] First, select the two samples with the largest overall distance and put them into the training set. Then, iteratively search for samples with the largest and smallest distances to the selected training set samples in the remaining samples and add them into the training set until a predetermined number is reached. The remaining samples are the test set. The test set is strictly isolated and does not participate in subsequent variance calculation and preprocessing parameter fitting.
[0011] Furthermore, step S21, variance filtering, performs global variance analysis on the original spectral matrix of the divided training set, calculates the variance value of each band (i.e., the feature column vector), and removes bands with variances below a threshold (e.g., ...). The extremely low variance bands are usually devoid of information signals and do not contain any useful chemical absorption information. Removing them directly before performing complex spectral transformations can not only significantly reduce the risk of dimensional explosion caused by high-dimensional data computation, but also effectively prevent the noise of tiny instruments from being drastically amplified when the denominator approaches zero during subsequent algorithm calculations.
[0012] The preprocessing described in step S21 includes no preprocessing, standard normal variable transformation, multivariate scattering correction, Savitzky-Golay smoothing, and differentiation. Multiple preprocessing steps are performed in parallel. Preprocessing parameters (such as mean, standard deviation, ideal average spectrum, etc.) are calculated only from the training set and mapped to the test set, generating multiple sets of preprocessed feature matrices to address different types of physical interference. (1) No processing (RAW): The original spectral information after filtering is retained as a baseline control group; (2) Standard normal variable transformation (SNV): The single spectral vector x is transformed to eliminate the surface scattering error caused by solid particle suspension; (3) Multivariate scattering correction (MSC): The ideal average spectrum of all samples is established, and the spectrum of each sample is subjected to univariate linear regression with it to correct the baseline translation and shift caused by optical path difference; (4) Savitzky-Golay smoothing and differentiation (SG / SG-1D): A specific window length (e.g., 25) and polynomial order (e.g., 3) are used to perform polynomial fitting on the data points within the window using the least squares method, and then the zero-order smoothing or first-order derivative is extracted to effectively eliminate high-frequency instrument noise and baseline drift. In this invention, the preprocessing method is preferably SG-1D, where the specific window length is 25 and the polynomial order is 3.
[0013] To address the technical bias that characteristic absorption peaks of trace components in complex oral liquid matrices are easily masked by strong water peaks and are prone to accidental deletion due to small absolute variance in single CARS random sampling, this solution constructs an asymmetric cascade optimization mechanism. Specifically, a PLS basic regression model is built based on the full band matrix. The global projection importance score (VIP score) of each band variable on the response factor is calculated. Bands with VIP scores exceeding a significance threshold (e.g., VIP ≥ 1.0) are selected as the target retention variable set. This target retention variable set is input into the CARS dynamic redundancy elimination model. A Monte Carlo iteration number is set, and the proportion of variables retained in the i-th iteration is dynamically controlled using an exponential decay function. A PLS model is built in each iteration, and bands with low weights are eliminated based on the absolute value of the regression coefficients. Finally, the band set with the smallest RMSECV in the 5-fold cross-validation is selected. This asymmetric cascade mechanism achieves both pre-emptive locking and preservation of weak effective signals and deep compression of concentration-independent structural noise space. This dual constraint ensures the absolute retention of core absorption peaks.
[0014] Furthermore, if the number of bands with VIP scores exceeding the significance threshold does not meet the set requirements, an auxiliary retention mechanism is triggered to forcibly retain the top-10 bands with the highest scores.
[0015] Furthermore, the base model pool includes partial least squares regression, support vector regression, k-nearest neighbor regression, Gaussian process regression, random forest, extreme gradient boosting tree, and gradient boosting decision tree. For each model in the base model pool, a unified 5-fold cross-validation combined with grid search technique is used to optimize the hyperparameters in multiple dimensions with the goal of minimizing the root mean square error of cross-validation. The out-of-bag prediction vector, test set prediction vector, and corresponding root mean square error value of each base model that reaches the optimal hyperparameters are recorded and saved, which together construct the base model pool.
[0016] The above seven base models have different underlying data mapping logics, specifically including three types: (1) Linear and partial least squares class: Partial least squares regression (PLS) extracts orthogonal principal component latent variables through nonlinear iteration and automatically traverses the optimal number of principal components; (2) Distance and kernel function based mapping algorithms: Support Vector Regression (SVR), K-Nearest Neighbor Regression (KNN), and Gaussian Process Regression (GPR). Since these algorithms are extremely sensitive to feature scale, this invention forces the feature matrix to undergo a standard normalization transformation (StandardScaler) before model fitting. Among them, SVR uses the RBF kernel for high-dimensional mapping, and GPR combines the RBF and Matern kernel functions to tolerate white noise in the spectral data to the greatest extent.
[0017] (3) Ensemble and tree model classes: Random Forest (RF), Extreme Gradient Boosting Tree (XGBoost), and Gradient Boosting Decision Tree (GBDT). By introducing second-order Taylor expansion and bootstrap sampling of multiple decision trees, complex nonlinear residual relationships are captured in depth.
[0018] Furthermore, the Meta-Train matrix is subjected to forced standard normalization before being input into the SVR meta-learner to eliminate distance metric bias caused by slight inconsistencies in the output scales of different underlying base models. During SVR meta-learner training, 5-fold cross-validation combined with grid search is used to jointly optimize the penalty coefficient and kernel parameters. This design allows the system to adaptively choose whether to perform optimal linear cutting in the original space or to map it to an infinite-dimensional Hilbert space for nonlinear fusion, based on the data characteristics, thereby maximizing the complementary predictive potential of the base models.
[0019] Furthermore, the oral liquid is Codonopsis pilosula oral liquid, and the quality control indicators are one or more of chlorogenic acid, syringin, genipin, codonopsis glycoside, codonopsis alcohol, and atractylodes lactone III.
[0020] Furthermore, in step S01, the true value Y of the content of the quality control index is obtained using the current ultra-high performance liquid chromatography method.
[0021] The beneficial effects of this invention are: (1) It significantly optimizes the traditional quality control process, realizing high-throughput, non-destructive, and green simultaneous rapid detection of multiple indicators. This invention can serve as an effective alternative to the cumbersome pretreatment (extraction, centrifugation, and volume adjustment) and physicochemical analysis process that heavily relies on toxic organic solvents in traditional liquid chromatography detection. Only one extremely short near-infrared spectral scan is required to instantly and simultaneously output the absolute content of six core trace and macro indicators, such as chlorogenic acid, syringin, and genipin, in Codonopsis pilosula oral liquid, providing a non-destructive and rapid online monitoring solution for the large-scale continuous production of complex traditional Chinese medicine preparations.
[0022] (2) An innovative “VIP+CARS” asymmetric dual-track feature screening mechanism was proposed, which effectively solved the technical problem of weak signal extraction under strong water background. Addressing the technical bottleneck that the absorption peaks of trace active ingredients in the complex matrix of oral liquids are easily masked by strong water absorption bands, this invention employs a cascaded collaborative filtering strategy. Utilizing the global response anchoring mechanism of VIP, it effectively prevents the “dimensional collapse” and erroneous removal of key weak signal feature variables in the early stages of CARS Monte Carlo random iteration. The dual constraints enabled a multiple-fold improvement in the relative analytical error (RPD) of the model for difficult indicators, breaking the technical bias that near-infrared technology cannot accurately quantify trace components of traditional Chinese medicine.
[0023] (3) A dynamic adaptive fusion architecture based on first-order difference mutation (inflection point method) of error is proposed to solve the "negative transfer" dilemma of the model. Existing chemometric ensemble models often use mechanical full weighting, which is easily dragged down by poor base models. Based on the heterogeneous base model pool, this invention innovatively introduces first-order difference gradient optimization, automatically captures the inflection point of model performance mutation for dynamic truncation, and combines support vector regression (SVR) dual-core optimization as a meta-learner for high-order nonlinear mapping. This architecture can adaptively cut off the contamination path of inferior models according to the data characteristics of specific components, and give full play to the complementary potential of high-quality base models, giving the quantitative model a strong generalization ability and industrial universality. Attached Figure Description
[0024] Figure 1 This is a flowchart of the method described in the embodiment. Detailed Implementation
[0025] The present invention will be further described below with reference to the accompanying drawings and specific embodiments.
[0026] Example 1: High-precision quantitative determination of genipin based on dual-track parallel feature extraction and dynamic integration.
[0027] Objective: To verify the ability of the method of the present invention to characterize and quantitatively predict trace effective components in complex water matrices.
[0028] like Figure 1 As shown, the specific implementation steps include the following: (1) The near-infrared spectral matrix of 143 bottles of Codonopsis pilosula oral liquid samples was obtained, and the absolute reference true value of genipin was determined by ultra-high performance liquid chromatography. The dataset was divided into training and test sets in an 8:2 ratio using the SPXY algorithm to effectively avoid data leakage.
[0029] (2) The original spectrum was preprocessed to resist physical interference using the Savitzky-Golay first derivative (SG-1D).
[0030] (3) Dual-track feature optimization verification: In view of the weak and easily masked signal of genipin, the system performs pure CARS and "VIP+CARS" cascade screening in parallel. The results show that after using VIP global anchoring to force the retention of key pharmacodynamic bands, and then inputting CARS for redundancy removal, the most core feature subset was successfully locked.
[0031] (4) Model Integration and Finalization: The core feature subsets mentioned above are input into the base model pool. The root mean square error of cross-validation (RMSECV) of each base model is calculated and the first-order difference is calculated. Automatic truncation is performed at the point of performance mutation. The high-quality OOF prediction matrix after truncation is standardized and then input into the support vector regression (SVR) meta-learner with RBF radial basis kernel for high-order mapping.
[0032] (5) Detection results: Validated on an independent test set, the “SG-1D+VIP+CARS+Dynamic Top-K Adaptive Fusion Model Based on Inflection Point Method” achieved excellent fitting results. Its test set determination coefficient (R²P) was as high as 0.9708, the root mean square error of prediction (RMSEP) was as low as 0.8285, and the relative analysis error (RPD) reached 5.8560 (far exceeding the excellent standard of 3.0). The model performed well and is feasible as an alternative to chromatographic reference methods.
[0033] Comparative Example 1: To objectively evaluate the technical advancement of the proposed construction scheme (SG-1D+VIP+CARS+Dynamic Top-K Adaptive Fusion Model based on Inflection Point Method), this embodiment selected genipin from Codonopsis pilosula oral liquid as a representative evaluation index and compared its performance with traditional chemometric modeling methods and other single algorithm combinations. The comparison results are shown in Table 1.
[0034] Table 1 The experimental data comparison and analysis in Table 1 show that: (1) Comparative Example 1 represents the traditional linear modeling paradigm (CARS+PLS). Due to its inability to handle high-order nonlinear disturbances in the oral liquid system, all its indicators are at the lowest level (RPD is only 2.9126). Comparative Example 2 introduces a nonlinear single model (GPR), increasing the RPD to 3.4784, demonstrating the necessity of nonlinear mapping in handling complex matrix data. Comparative Example 3, while retaining the traditional CARS screening, attempts the fusion model architecture of this invention, resulting in a significant jump in RPD to 4.7596. This proves that the fusion architecture proposed in this invention, based on performance inflection point truncation, can effectively squeeze the complementary potential of the base models, and its performance is far superior to any single regression algorithm.
[0035] (2) Based on Comparative Example 3, the method of this invention further introduces the "VIP+CARS" asymmetric dual-track feature screening mechanism. Experimental results show that its RPD is further improved from 4.7596 to 5.8560, and RMSEP is reduced by about 18.7%. This data fully confirms the key role of "VIP global response anchoring" in processing genipin: it successfully locks the weak core feature bands that are easily lost in pure CARS Monte Carlo sampling, thus ensuring the fidelity of the model input from the source.
[0036] (3) The present invention achieves the "excellent" standard (RPD>3.0, or even close to 6.0) in the near-infrared spectroscopy analysis by deeply coordinating the feature screening track and the regressor architecture. This not only surpasses the traditional algorithm at the mathematical and statistical level, but also proves the feasibility of the model in replacing traditional methods such as ultra-high performance liquid chromatography for industrial-grade rapid quality control in practical applications.
[0037] Example 2: Final selection of chlorogenic acid model based on the dual mechanism of "internal filtration robustness + external experimental practice".
[0038] Objective: To verify the key role of the dual ultimate model selection mechanism described in step S8 of this invention in preventing overfitting and establishing the optimal practical model.
[0039] Specific implementation steps: (1) For the chlorogenic acid index, complete the spectral acquisition, segmentation and feature extraction according to the same steps as described above. After the base model competition and SVR meta-model fusion are completed, a candidate model array containing multiple algorithm combinations is generated.
[0040] (2) Activate the internal robust defense line (S8.1): Extract the RMSECV of all candidate models and sort them in ascending order. The system automatically locks the top-3 robust model set and strictly filters out overfitted models that have high test set scores on the surface but poor internal anti-perturbation ability.
[0041] (3) Start external practical verification (S8.2): In the Top-3 robust set, although the RMSECV of a certain single base model is absolutely first, its performance on the test set is mediocre; while the fusion model "SG-1D+VIP+CARS+Dynamic Top-K Adaptive Fusion Model Based on Inflection Point Method" constructed based on this invention has an internal error RMSECV (0.6607) that ranks second, but shows excellent generalization ability on the unknown test set, and its relative analysis error (RPD) reaches the highest value of 3.5072 in the field.
[0042] (4) The system finally triggers the final output based on dual criteria, solidifying the fusion model into a dedicated program for chlorogenic acid detection. The coefficient of determination (R²P) of the measured prediction set is 0.9187, proving that the selection mechanism effectively takes into account both the algorithm fitting and industrial measurement needs.
[0043] Example 3: Parallel detection deployment of 6 core indicators of Lu Dang Shen oral liquid (overall effect verification).
[0044] Objective: To verify the universality and industrialization potential of the integrated intelligent architecture constructed in this invention in high-throughput, multi-indicator synchronous quality control.
[0045] Specific implementation steps: For six distinctive small molecule components in Codonopsis pilosula oral liquid—chlorogenic acid, syringin, genipin, codonopsis glycoside, codonopsis alcohol, and atractylodes lactone III—with varying structures and concentrations, the scheme of this invention was uniformly applied for parallel computation. Verification results: The system, with its powerful low-level adaptive capabilities, automatically matched corresponding feature selection tracks (VIP+CARS or pure CARS) for the six indicators, and the final regressor architectures all converged uniformly to the dynamic Top-K adaptive fusion model based on the inflection point method designed in this invention. The specific detection accuracy of each indicator is shown in Table 2. Table 2 Conclusion: The external independent validation RPDs for the above six key indicators are all significantly greater than 2.5 (five of which far exceed the industry-recognized replacement threshold of 3.0). This demonstrates that the model architecture involved in this invention does not require cumbersome low-level code reconstruction for different components. With just a single near-infrared spectral scan, the content of six indicators can be detected quickly and simultaneously with high accuracy, improving the traditional quality control paradigm of time-consuming and highly polluting chromatographic analysis, and enabling industrial-grade application.
[0046] The above description is merely the basic principle and preferred embodiment of the present invention. Improvements and substitutions made by those skilled in the art based on the present invention are within the scope of protection of the present invention.
Claims
1. A traditional Chinese medicine oral liquid quality control index content detection method, characterized in that: Includes the following steps: S01. Collect samples of oral liquid Chinese medicine to be tested, prepare reference samples and determine the true value of their quality control index content Y, collect the near-infrared spectra of all samples to be tested, and form a spectral matrix X. S02. Data preprocessing: Divide the spectral data and true content values of all samples to be tested into training and testing sets. Construct a quality control index content detection model based on the training and testing sets, including: S21. Variance filtering and preprocessing are performed on the spectral data in the training set. Then, an asymmetric cascade optimization mechanism based on global response anchoring and dynamic elimination is used to extract the core feature subset from the spectral data. S22. Based on the extracted core feature subset, establish a base model pool and obtain the out-of-bag (OOF) prediction matrix of each base model through 5-fold cross-validation. S23. Extract the root mean square error of cross-validation corresponding to all base models and sort them in ascending order. Generate the first-order difference sequence of adjacent model errors. Find the largest positive jump point in the first-order difference sequence as the performance mutation inflection point. Remove the base models after the performance mutation inflection point to form a dynamic Top-K high-quality subset. S24. Concatenate the OOF prediction matrices of the Top-K high-quality subsets column by column to form the Meta-Train matrix. Train the Support Vector Regressor (SVR) based on the hybrid kernel function on the Meta-Train matrix. The hybrid kernel function is formed by linearly superimposing the Linear kernel and the RBF radial basis kernel through adaptive weights at the performance inflection point. S25. Construct a quality control index content detection model consisting of asymmetric cascade optimization, dynamic Top-K truncation, and SVR meta-learner, and adjust the quality control index content detection model based on the test set. S03. The content of quality control indicators of traditional Chinese medicine oral liquid is detected based on the quality control indicator content detection model.
2. The method according to claim 1, characterized in that: Step S02: Based on the SPXY algorithm, all spectral data and true concentration values of the samples to be tested are divided into training set and test set. First, the spectral data and true concentration values are normalized. Then, the Euclidean distance between any two samples in the spectral space and concentration space is calculated. Then, the normalized comprehensive distance is calculated. First, the two samples with the largest comprehensive distance are selected and put into the training set. Then, the remaining samples are iteratively searched for the samples with the maximum and minimum distances to the selected training set samples and added to the training set until a predetermined number is reached. The remaining samples are the test set.
3. The method according to claim 1, characterized in that: The variance filtering in step S21 performs global variance analysis on the original spectral matrix of the divided training set, calculates the variance value of each band, and removes bands with variances below the threshold. The preprocessing in step S21 includes no processing, standard normal variable transformation, multivariate scattering correction, Savitzky-Golay smoothing, and differentiation.
4. The method according to claim 1, characterized in that: The asymmetric cascade optimization mechanism described in step S21 is as follows: A PLS basic regression model is constructed based on the full band matrix. The global projection importance score (VIP score) of each band variable on the response factor is calculated. Bands with VIP scores exceeding the significance threshold are selected as the target retention variable set. The target retention variable set is input into the CARS dynamic redundancy elimination model. The number of Monte Carlo iterations is set, and the proportion of variables retained in the i-th iteration is dynamically controlled using the exponential decay function. In each iteration, the PLS model is built, and bands with small weights are eliminated according to the absolute value of the regression coefficients. Finally, the band set with the smallest RMSECV in the 5-fold cross-validation is selected.
5. The method according to claim 4, characterized in that: If the number of bands with VIP scores exceeding the significance threshold does not meet the set requirements, an auxiliary retention mechanism is triggered to forcibly retain the top-10 bands with the highest scores.
6. The method according to claim 1, characterized in that: The base model pool includes partial least squares regression, support vector regression, k-nearest neighbor regression, Gaussian process regression, random forest, extreme gradient boosting tree, and gradient boosting decision tree. For each model in the base model pool, a unified 5-fold cross-validation combined with grid search technique is used to optimize the hyperparameters in multiple dimensions with the goal of minimizing the root mean square error of cross-validation. The out-of-bag prediction vector, test set prediction vector, and corresponding root mean square error value of each base model that reaches the optimal hyperparameters are recorded and saved to jointly construct the base model pool.
7. The method for detecting the content of quality control indicators in traditional Chinese medicine oral liquids according to claim 1, characterized in that: Before the Meta-Train matrix is input into the SVR meta-learner, forced standard normalization is performed to eliminate the distance metric bias caused by the slight inconsistency in the output scale of different underlying base models. During the training of the SVR meta-learner, 5-fold cross-validation and joint grid search are used to jointly optimize the penalty coefficient and kernel parameters.
8. The method for detecting the content of quality control indicators in traditional Chinese medicine oral liquids according to claim 1, characterized in that: The oral liquid is Codonopsis pilosula oral liquid, and the quality control indicators are one or more of chlorogenic acid, syringin, genipin, codonopsis glycoside, codonopsis alcohol, and atractylodes lactone III.
9. The method for detecting the content of quality control indicators of traditional Chinese medicine oral liquid according to claim 1, characterized in that: In step S01, the true value Y of the quality control index content is obtained by ultra-high performance liquid chromatography.