Method for identifying years of citrus pericarp based on hierarchical multi-modal fusion

By integrating near-infrared spectroscopy, color, and electronic tongue features through a hierarchical multimodal fusion network, the problem of identifying the age of dried tangerine peel has been solved, achieving high-precision identification and quality control of dried tangerine peel age and providing a standardized identification method.

CN122196753APending Publication Date: 2026-06-12JIANGXI NORMAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGXI NORMAL UNIV
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies cannot fully capture the multi-dimensional information of the aging process of dried tangerine peel, making it difficult to distinguish subtle aging differences and deal with complex adulteration situations, and lacking standardized identification methods.

Method used

A hierarchical multimodal fusion method is adopted, which combines near-infrared spectroscopy, color, electronic nose and electronic tongue features. Feature extraction and fusion are performed by constructing a hierarchical multimodal fusion network. A one-dimensional ResNet backbone network and a fully connected network are used for encoding and mapping, and residual feature fusion is performed. Finally, a shallow multilayer perceptron is used for classification.

🎯Benefits of technology

It achieves high-precision identification of the age of dried tangerine peel, solves the problems of strong subjectivity and difficulty in standardization, provides an objective and quantitative identification method, and improves the standardization and anti-counterfeiting capabilities of dried tangerine peel quality control.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196753A_ABST
    Figure CN122196753A_ABST
Patent Text Reader

Abstract

The application discloses a method for identifying the years of dried orange peel based on layered multi-modal fusion, and relates to the technical field of dried orange peel quality identification.The method comprises the following steps: step 1, collecting multi-modal data of each dried orange peel sample, including near-infrared spectrum, color, electronic nose and electronic tongue features, and performing standardization processing on all the features; and step 2, constructing a layered multi-modal fusion network, performing backbone feature extraction and double-path projection on the main modal, and performing encoding mapping on the auxiliary modal.Compared with the prior art, the method has the beneficial effects that: the method effectively integrates multi-source heterogeneous information, realizes high-precision identification of the aging years of dried orange peel, adopts a two-stage layered fusion strategy, effectively retains the key chemical information unique to the near-infrared modal by residual refinement while capturing cross-modal complementary information, and solves the problems that traditional identification methods are highly subjective, inefficient and difficult to standardize.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of tangerine peel quality identification technology, specifically a method for identifying the age of tangerine peel based on hierarchical multimodal fusion. Background Technology

[0002] The aging process of dried tangerine peel involves complex evolution of its physicochemical properties, including coupled changes in chemical components (such as flavonoids and volatile oils), macroscopic color, microscopic aroma components, and taste characteristics. These changes collectively determine the pharmacological efficacy and market value of dried tangerine peel. Due to the complexity and multidimensionality of the aging process, no single-modal technology can comprehensively capture all relevant aging information, resulting in performance bottlenecks when distinguishing subtle aging differences or dealing with complex adulteration situations. Therefore, a multimodal acquisition method integrating color, aroma, taste, and spectral information is needed to more comprehensively characterize the aging state. Summary of the Invention

[0003] This invention provides a method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion, in order to solve the problems mentioned in the background art.

[0004] According to an embodiment of the present invention, a method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion is provided, comprising the following steps: Step 1: Collect multimodal data for each tangerine peel sample, including near-infrared spectrum, color, electronic nose and electronic tongue features, and standardize all features, with near-infrared spectrum as the primary mode and color, electronic nose and electronic tongue as auxiliary modes. Step 2: Construct a hierarchical multimodal fusion network, extract the backbone features of the main modes, perform dual-path projection on the main modal features and generate near-infrared spectral fusion features and near-infrared spectral refinement features, and encode and map the auxiliary modes. Step 3: Perform the first stage of multimodal fusion, which involves splicing the near-infrared spectral fusion features with three auxiliary modal features. The spliced ​​features are then used to generate a preliminary fusion representation through a fusion block. Step 4: Perform the second stage of main mode information refinement. The preliminary fused representation and the near-infrared spectral refinement features are fused using residual feature fusion to obtain the final fused feature representation, which is then used for final classification. Step 5: Evaluate the model performance.

[0005] As a further aspect of the present invention: in step 1, the multimodal data includes: Near-infrared (NIR) spectral data, with the spectra stitched together to form a 250-dimensional vector. ; Color feature data is a 6-dimensional vector. ; The electronic nose feature data is a 10-dimensional vector. ; The electronic tongue feature data is an 8-dimensional vector. .

[0006] As a further aspect of the present invention: In step 2, the backbone feature extraction adopts a one-dimensional ResNet backbone network, the encoding mapping adopts a fully connected network, and the encoding mapping is completed by a fully connected encoder and mapped to a shared latent space. The dual-path projection generates fusion features for the first stage of multimodal fusion and refinement features for the second stage of main modality information refinement through two independent fully connected layers in the fully connected network.

[0007] As a further aspect of the present invention: in step 3, the splicing is completed in the latent space, and the fusion block is a fusion layer with batch normalization, activation function and Dropout, generating a preliminary fusion representation.

[0008] As a further aspect of the present invention: in step 4, residual feature fusion is performed by adding the preliminary fused representation to the near-infrared spectral refinement features, retaining the year information dominated by the main modes, and obtaining the final fused feature representation.

[0009] As a further aspect of the present invention: in step 4, the final classification adopts a shallow multilayer perceptron (MLP) classifier to output the category probability of the year of the tangerine peel.

[0010] As a further aspect of the present invention: in step 5, the indicators used to evaluate the model performance include overall accuracy, accuracy of class K, recall of class K, F1-score of class K, macro-average index and Cohen's Kappa coefficient.

[0011] Compared with existing technologies, the beneficial effects of this invention are as follows: This invention's method for identifying the age of tangerine peel based on hierarchical multimodal fusion effectively integrates multi-source heterogeneous information such as near-infrared spectroscopy, color, electronic nose, and electronic tongue through a unique hierarchical multimodal fusion architecture, achieving high-precision identification of the aging age of tangerine peel. Employing a two-stage hierarchical fusion strategy, it captures complementary cross-modal information while effectively preserving key chemical information unique to the near-infrared mode through residual refinement, achieving optimal information utilization. This transforms the traditional aging age identification of tangerine peel, which relies on expert experience, into an objective and quantitative assessment based on instrument measurement and artificial intelligence algorithms. It effectively solves the problems of strong subjectivity and difficulty in standardization, providing standardized, anti-counterfeiting, and data-driven technical support for the quality control of tangerine peel and other high-value Chinese herbal medicines. Attached Figure Description

[0012] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 This is a schematic diagram of a single-modal feature and its simple spliced ​​t-SNE projection provided in an embodiment of the present invention.

[0014] Figure 2 This is a schematic diagram of a confusion matrix provided in an embodiment of the present invention. Detailed Implementation

[0015] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0016] A method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion includes the following steps: Step 1: Collect multimodal data for each tangerine peel sample, including near-infrared (NIR) spectrum, color, electronic nose and electronic tongue features, and standardize all features, with near-infrared spectrum as the primary mode and color, electronic nose and electronic tongue as auxiliary modes. Step 2: Construct a hierarchical multimodal fusion network, extract the backbone features of the main modes, perform dual-path projection on the main modal features and generate near-infrared spectral fusion features and near-infrared spectral refinement features, and encode and map the auxiliary modes. Step 3: Perform the first stage of multimodal fusion, which involves splicing the near-infrared spectral fusion features with three auxiliary modal features. The spliced ​​features are then used to generate a preliminary fusion representation through a fusion block. Step 4: Perform the second stage of main mode information refinement. The preliminary fused representation and the near-infrared spectral refinement features are fused using residual feature fusion to obtain the final fused feature representation, which is then used for final classification. Step 5: Evaluate the model performance.

[0017] In this embodiment, in step 1, the multimodal data includes: Near-infrared spectral data, with the spectra stitched together to form a 250-dimensional vector. ; Color feature data is a 6-dimensional vector. ; The electronic nose feature data is a 10-dimensional vector. ; The electronic tongue feature data is an 8-dimensional vector. .

[0018] In a specific embodiment, near-infrared spectral data acquisition employs a near-infrared spectrometer, such as the American VIAVI 1700 model, to perform non-destructive spectral scanning on the tangerine peel sample within a specific wavelength range (e.g., 908.1-1676.2 nm), acquiring the near-infrared spectra of the outer peel and inner lining of the tangerine peel separately. The two spectral data are then stitched together to form a 250-dimensional feature vector. Color feature data acquisition utilizes a high-precision colorimeter, such as the Japanese Konica Minolta CR-400 model, to quantify the surface color of the tangerine peel sample in the international standard CIE Lab color space. Multiple points on both the outer peel and inner lining of the tangerine peel are measured, and the average value is taken to obtain a representative L value. 、a b* value; Electronic nose feature data acquisition uses an electronic nose system equipped with multiple types of metal oxide semiconductor (MOS) sensors, such as the German Airsense Analytics GmbH PEN3 model, to simulate olfactory perception. The tangerine peel sample is placed in a sealed headspace vial and incubated at a constant temperature to allow volatile organic compounds to fully volatilize to gas phase equilibrium. Then, the headspace gas is introduced into the sensor array, and the dynamic response curve is recorded to obtain 10-dimensional aroma features. Electronic tongue feature data acquisition uses a multi-probe electronic tongue system, such as the Japanese Insent SA402B Plus-EX model, to simulate taste perception. The water-soluble taste components of the tangerine peel sample are quantified. After the tangerine peel extract is cooled to room temperature, multi-stage measurements are performed to obtain 8-dimensional taste attribute data such as acidity, bitterness, astringency, and umami. All features are standardized to ensure that the mean of all features is 0 and the variance is 1, in order to eliminate sensor dimensional differences and improve the stability of model training.

[0019] In this embodiment, in step 2, the backbone feature extraction adopts a one-dimensional Res Net backbone network, the encoding mapping adopts a fully connected network, and the encoding mapping is completed by a fully connected encoder and mapped to the shared latent space. The dual-path projection generates near-infrared spectral fusion features and near-infrared spectral refinement features through two independent fully connected layers in the fully connected network, which are used for the first-stage multimodal fusion and the second-stage main modal information refinement, respectively.

[0020] In a specific embodiment, the high-level features of the NIR spectrum of the one-dimensional ResNet backbone network, represented by the output NIR latent feature vector, are as follows: ; In the formula, Indicates NIR encoder, For trainable parameters, For NIR latent feature vectors, To share the dimensions of the hidden space, Near-infrared spectral data for each tangerine peel sample.

[0021] Modal The embedding vector extraction in the shared latent space involves extracting the main modality information, extracting the backbone features, encoding the auxiliary modality information, and projecting the main modality information in a dual-path manner, as shown below: ; In the formula, and Modal The learnable weight matrix and bias, It is the ReLU activation function. This is a random deactivation operation with a fixed probability. , For modality Embedded vectors in the shared latent space The original input feature vector corresponding to the modality, This represents an auxiliary modality type, which has the following characteristics: , These represent color, electronic nose, and electronic tongue, respectively.

[0022] Near-infrared spectral fusion features and near-infrared spectral refinement features are respectively represented as follows: ; ; In the formula, and For learnable parameters, Fusion features used for the first stage of multimodal fusion. This is a refined feature derived from the second-stage master modality information after random deactivation regularization.

[0023] In this embodiment, in step 3, the splicing is completed in the latent space, and the fusion block is a fusion layer with batch normalization, activation function and Dropout, generating a preliminary fusion representation.

[0024] In a specific embodiment, the concatenated joint feature vector is represented as follows: ; In the formula, This represents vector concatenation. This is the concatenated joint feature vector. The embedding vectors of the three auxiliary modes, namely color, electronic nose and electronic tongue, in the shared latent space are obtained by formula (2).

[0025] Feature representation after initial fusion: ; In the formula, This indicates a batch normalization operation. and The first stage involves fusing the weight matrix and bias of the fully connected layer. This represents the features after initial fusion.

[0026] In this embodiment, in step 4, residual feature fusion adds the preliminary fused representation to the near-infrared spectral refinement features, retaining the year information dominated by the main mode, to obtain the final fused feature representation.

[0027] In a specific embodiment, the final fused feature representation is as follows: ; In the formula, The final fusion representation captures both cross-modal similarity and NIR-specific age trends.

[0028] In this embodiment, in step 4, the final classification uses a shallow multilayer perceptron (MLP) classifier to output the category probability of the year of the tangerine peel.

[0029] In a specific embodiment, the output category probability is represented as follows: ; ; In the formula, No. The fused feature vector of each sample, and For the parameters of the final linear layer, For the sample Corresponding category The logit value, For the sample Category The predicted probability, This indicates the year and category of dried tangerine peel.

[0030] Optimization is achieved through model training.

[0031] Model training representation: ; In the formula, For batch size, For the sample The true category label, This represents the model's predicted probability for the true class.

[0032] In this embodiment, the metrics used to evaluate model performance in step 5 include overall accuracy, accuracy of class K, recall of class K, F1-score of class K, macro-average index, and Cohen's Kappa coefficient.

[0033] The metrics used can reflect the model's classification ability and robustness from different perspectives.

[0034] In a specific embodiment, the overall accuracy is expressed as: ; In the formula, A pointer function that evaluates to true if its argument is true (i.e., ... If the predicted class matches the true class, then the value is 1. Otherwise (i.e., ... If the predicted category is inconsistent with the true category, then the value is 0. For the sample Prediction categories, For the sample The corresponding true category, where N is the total number of samples.

[0035] Precision representation for category k: ; ; In the formula, The number of samples that are correctly predicted as class k. This represents the number of samples that were incorrectly predicted as category k but belong to other categories. This represents the number of samples that belong to category k but are incorrectly classified into other categories.

[0036] The F1 score representation of category k: ; In the formula, Let f be the F1 score of the k-th category. Let be the precision of the k-th category. Let be the recall rate for the k-th category. The trade-off between precision and recall for category k is summarized.

[0037] The macro-average metrics are obtained by averaging across all K categories and include: macro-average precision, macro-average recall, and macro-average F1-score, which represent: ; ; ; In the formula, For macro average accuracy, For macro average recall, Let K be the macro average F1 score. K is the total number of categories, and k is the k-th category.

[0038] Cohen's Kappa coefficient is expressed as follows: ; In the formula, Observed consistency is equal to overall accuracy. For random consistency, it is estimated by the marginal distribution of the confusion matrix.

[0039] Actual verification: Please see Figure 1A schematic diagram of single-modal features and their simple splicing t-SNE projection, wherein the single-modal features include near-infrared features, electronic tongue features, electronic nose features, color features and splicing fusion features. The t-SNE projection intuitively evaluates the separability of different aging categories in the feature space. The tangerine peel samples are all collected from the same place of origin and are divided into five aging stages: 3 years, 5 years, 10 years, 15 years and 20 years.Each stage contains 40 complete pieces of aged tangerine peel, totaling 200 pieces. To ensure consistent aging year distribution across subsets, the entire multimodal dataset was randomly partitioned using stratified sampling. The dataset was divided into: a 64% training set (128 tangerine peel samples), a 16% validation set (32 tangerine peel samples), and a 20% independent test set (40 tangerine peel samples). The training set was specifically used to fit model parameters; the validation set was used to fine-tune model hyperparameters and monitor overfitting; and the independent test set was used to provide an unbiased estimate of the model's final predictive performance. Multimodal data was collected for each tangerine peel sample, including near-infrared spectroscopy, color, electronic nose, and electronic tongue features, and all features were analyzed. The features are standardized, and a hierarchical multimodal fusion network is constructed according to the NIR latent feature vector output by formula (1). The main modal features are projected in two paths to generate near-infrared spectral fusion features and near-infrared spectral refinement features. At the same time, the auxiliary modalities are encoded and mapped. According to formula (2), the embedding vectors of auxiliary modalities such as color, electronic nose and electronic tongue in the shared latent space are encoded as auxiliary modal information and mapped to the shared latent space. According to formulas (3) and (4), they are used for fusion and refinement respectively, generating the fusion features of the first stage of multimodal fusion and the refinement features of the second stage main modal information refinement after random deactivation regularization. Through the unique hierarchical multimodal fusion architecture, the network effectively integrates the main modal features. By combining multi-source heterogeneous information such as near-infrared spectrum, color, electronic nose, and electronic tongue, high-precision identification of the aging years of dried tangerine peel is achieved. The first stage of multimodal fusion is carried out by splicing near-infrared spectral fusion features with three auxiliary modal features. The spliced ​​features are used to generate a preliminary fusion representation through fusion blocks. According to formula (5), the NIR fusion features are spliced ​​with color, electronic nose, and electronic tongue information features in the latent space. According to formula (6), the fusion information is initially fused and represented. The second stage of main modal information refinement is carried out by performing residual feature fusion with the refined near-infrared spectral features to obtain the final fusion feature representation, which is used for final classification. According to formula (7), the first stage fusion features are combined with the refined NIR features. The fusion features are added together, and the age information dominated by NIR is retained to obtain the final fusion features. The final fusion features are input into the classifier according to formulas (8) and (9) to output the probability of the five year categories. A two-stage hierarchical fusion strategy is adopted. While capturing cross-modal complementary information, the key chemical information unique to the near-infrared mode is effectively retained through residual refinement, thus achieving the optimal utilization of information. The model performance is evaluated. The model is trained and optimized according to formula (10). The model performance is judged according to formulas (11) to (18). The overall accuracy of the model, the accuracy of category K, the recall of category K, the F1-score of category K, the macro-average index and the Cohen's Kappa coefficient are calculated.

[0040] Please see Figure 2The single-modal baseline performance of classifying the aging years of tangerine peel using near-infrared spectroscopy, color features, electronic nose, and electronic tongue is shown in Appendix Table 1. The Partial Least Squares Discriminant Analysis (PLS-DA) model based on near-infrared spectroscopy achieved an accuracy of 90.0%, a macro-average F1-score of 0.897, and a Cohen's Kappa coefficient of 0.875 on the test set. This indicates that NIR alone carries strong discriminative information, but significant off-diagonal elements still exist in the confusion matrix. For taste perception, the Support Vector Machine (SVM-RBF) model based on electronic tongue features achieved an accuracy of 95.0% and a macro-average F1-score of 0.949. The confusion matrix of this hierarchical multimodal fusion network and the classic baseline model for classifying the aging years of tangerine peel is shown in Appendix Table 1. Figure 2 As shown, Figure 2 In the table, (a) is the confusion matrix of partial least squares discriminant analysis, (b) is the confusion matrix of radial basis function support vector machine, (c) is the confusion matrix of K-nearest neighbor algorithm, (d) is the confusion matrix of random forest, and (e) is the confusion matrix of hierarchical multimodal fusion network.

[0041] Table 1. Performance of Single-Modal Baseline Classifier

[0042] The early fusion baseline performance is shown in Appendix Table 2. When the concatenated features of the four modalities are input into the K-Nearest Neighbors (KNN) model, the test accuracy reaches 92.5%, and the macro-average F1 score is 0.924. The PLS-DA model with concatenated features achieves an accuracy of 95.0% and a macro-average F1 score of 0.950, comparable to the strongest single-modal Support Vector Machine (SVM) and Random Forest (RF) models. The RF model with concatenated features also achieves an accuracy of 92.5%, demonstrating improved performance compared to the single-modal model by simply concatenating the features of the four modalities and inputting them into the classic machine learning model.

[0043] The performance of the hierarchical multimodal fusion network is shown in Appendix Table 2. On the independent test set, the hierarchical multimodal fusion network achieved an accuracy of up to 98.0%, a macro-average F1 score of 0.975, and a Cohen's Kappa coefficient of 0.969. The hierarchical multimodal fusion network deep learning model demonstrates excellent overall performance, significantly outperforming all baseline models.

[0044] Table 2. Performance Comparison of Multimodal Fusion Classifiers for Classifying the Aging Years of Dried Tangerine Peel

[0045] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims.

[0046] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.

Claims

1. A method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion, characterized in that, The method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion includes the following steps: Step 1: Collect multimodal data for each tangerine peel sample, including near-infrared spectrum, color, electronic nose and electronic tongue features, and standardize all features, with near-infrared spectrum as the primary mode and color, electronic nose and electronic tongue as auxiliary modes. Step 2: Construct a hierarchical multimodal fusion network, extract the backbone features of the main modes, perform dual-path projection on the main modal features and generate near-infrared spectral fusion features and near-infrared spectral refinement features, and encode and map the auxiliary modes. Step 3: Perform the first stage of multimodal fusion, which involves splicing the near-infrared spectral fusion features with three auxiliary modal features. The spliced ​​features are then used to generate a preliminary fusion representation through a fusion block. Step 4: Perform the second stage of main mode information refinement. The preliminary fused representation and the near-infrared spectral refinement features are fused using residual feature fusion to obtain the final fused feature representation, which is then used for final classification. Step 5: Evaluate the model performance.

2. The method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion according to claim 1, characterized in that, In step 1, the multimodal data includes: Near-infrared spectral data, with the spectra stitched together to form a 250-dimensional vector. ; Color feature data is a 6-dimensional vector. ; The electronic nose feature data is a 10-dimensional vector. ; The electronic tongue feature data is an 8-dimensional vector. .

3. The method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion according to claim 1, characterized in that, In step 2, the backbone feature extraction adopts a one-dimensional ResNet backbone network, the encoding mapping adopts a fully connected network, and the encoding mapping is completed by a fully connected encoder and mapped to the shared latent space. The dual projection generates the fusion features for the first stage of multimodal fusion and the refinement features for the second stage of main modality information refinement through two independent fully connected layers in the fully connected network.

4. The method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion according to claim 1, characterized in that, In step 3, the splicing is completed in the latent space, and the fusion block is a fusion layer with batch normalization, activation function and Dropout, generating a preliminary fusion representation.

5. The method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion according to claim 1, characterized in that, In step 4, residual feature fusion adds the preliminary fused representation to the near-infrared spectral refinement features, retaining the year information dominated by the main modes, to obtain the final fused feature representation.

6. The method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion according to claim 1, characterized in that, In step 4, the final classification uses a shallow multilayer perceptron classifier to output the category probability of the tangerine peel's year.

7. The method for identifying the age of dried tangerine peel based on hierarchical multimodal fusion according to claim 1, characterized in that, In step 5, the metrics used to evaluate model performance include overall accuracy, accuracy of class K, recall of class K, F1-score of class K, macro-average index, and Cohen's Kappa coefficient.