A liquor machine tasting method based on multi-modal fusion and deep learning

By integrating olfactory, gustatory, and visual data through multimodal fusion and deep learning, multimodal digital feature vectors are generated. These vectors are then trained using a multi-task attention fusion network, which solves the problems of accuracy and consistency in the evaluation of baijiu quality and achieves an objective and stable evaluation of baijiu quality.

CN122241568APending Publication Date: 2026-06-19CHINA AGRI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA AGRI UNIV
Filing Date
2026-03-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively integrate olfactory, gustatory, and visual multimodal data, resulting in insufficient accuracy and consistency in the evaluation of baijiu quality. Furthermore, general machine learning models are unable to optimize multiple evaluation objectives.

Method used

By employing a multimodal fusion and deep learning approach, this method integrates olfactory, gustatory, and visual data, utilizes a multi-task attention fusion network to establish feature mapping relationships, generates multimodal digital feature vectors, and performs multi-task training to output evaluation results.

🎯Benefits of technology

It achieves objective and stable evaluation of liquor quality, reduces the subjectivity and volatility of manual evaluation, and improves the consistency and repeatability of evaluation results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241568A_ABST
    Figure CN122241568A_ABST
Patent Text Reader

Abstract

This invention discloses a machine-based baijiu (Chinese liquor) tasting method based on multimodal fusion and deep learning, relating to the field of intelligent food quality detection and sensory evaluation. The method includes the following steps: acquiring corresponding olfactory time-series data, gustatory current time-series data, and visual image data based on multiple batches of baijiu samples; performing preprocessing and extracting olfactory feature vectors, gustatory feature vectors, and visual feature vectors; performing dimensionality reduction and weighted fusion processing to generate multimodal digital feature vectors; training a multi-task attention fusion network using the corresponding quality grade and sensory score as training labels; and inputting the multimodal digital feature vectors corresponding to the baijiu sample to be tested into the trained multi-task attention fusion network to obtain the corresponding tasting results. This method realizes the transformation of baijiu tasting from subjective experience-dependent to objective data-driven, improving the stability and repeatability of the evaluation results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent food quality detection and sensory evaluation, and more specifically to a machine tasting method for baijiu (Chinese liquor) based on multimodal fusion and deep learning. Background Technology

[0002] The quality evaluation of baijiu (Chinese liquor) relies heavily on the sensory experience of professional tasters. It involves a comprehensive judgment of the liquor's color, aroma, taste, and style. This process is the core link in the quality control, grading, and market pricing of baijiu products.

[0003] In order to reduce the subjectivity of human evaluation, existing technologies have gradually introduced digital detection methods such as electronic noses, electronic tongues, and machine vision. For example, volatile components are detected by electronic nose sensor arrays, or taste characteristics are analyzed by the electrochemical response of electronic tongues. However, most of these technologies analyze a single dimension of smell, taste, or vision independently. This isolated detection method can only reflect one aspect of the quality of baijiu and is difficult to capture the complex and comprehensive characteristics of baijiu that are unique to it and formed by the synergistic effect of multiple sensory attributes. This leads to one-sided evaluation conclusions and cannot replace complete human sensory evaluation.

[0004] Furthermore, some studies have attempted to combine multiple digital sensory detection data. However, due to the fundamental differences in the sources, dimensions, temporal characteristics, and physical meanings of data from different modalities, existing methods typically employ simple data splicing or shallow statistical fusion. This not only fails to effectively uncover the deep correlations and complementary information between modalities but also easily introduces noise and redundancy, causing information overload or feature distortion. Consequently, the final digital representation is difficult to accurately and stably correspond to the actual quality grade and sensory experience of baijiu.

[0005] Furthermore, establishing the mapping relationship between features and the final evaluation conclusion remains a challenge. Existing methods often use general machine learning models for processing. However, baijiu evaluation is essentially a complex task involving classification and regression. General model architectures are difficult to optimize multiple objectives simultaneously and lack the ability to specifically model the internal interaction relationships of multimodal features, resulting in insufficient accuracy and interpretability of model predictions.

[0006] Therefore, how to design a baijiu machine tasting method based on multimodal fusion and deep learning, which can synergistically utilize multi-dimensional sensory information and achieve objective and stable intelligent baijiu tasting through effective fusion and dedicated modeling, is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0007] In view of this, the present invention provides a method for machine evaluation of baijiu based on multimodal fusion and deep learning. It aims to solve the problems of strong subjectivity, poor consistency and difficulty in standardization of traditional human evaluation by integrating multimodal sensor data such as smell, taste and vision, and using deep learning models to establish a mapping relationship between digital features and human evaluation results. This provides a reliable solution for quality control, grading and quality evaluation in the baijiu production process.

[0008] To achieve the above objectives, the present invention adopts the following technical solution:

[0009] A method for machine-based baijiu (Chinese liquor) tasting based on multimodal fusion and deep learning includes the following steps: S1. Based on multiple batches of baijiu samples, obtain the corresponding olfactory time series data, taste current time series data and visual image data; S2. Preprocess the olfactory time series data, gustatory current time series data and visual image data, and extract olfactory feature vectors, gustatory feature vectors and visual feature vectors based on the preprocessed data. S3. Perform dimensionality reduction processing on the olfactory feature vector, gustatory feature vector and visual feature vector, and perform weighted fusion on the dimensionality-reduced features to generate a multimodal digital feature vector. S4. Using the corresponding quality level and sensory score as training labels, train a multi-task attention fusion network using the multimodal digital feature vector; S5. Input the multimodal digital feature vector corresponding to the liquor sample to be tested into the trained multi-task attention fusion network to obtain the corresponding evaluation results.

[0010] Preferably, S1 includes: A metal oxide semiconductor sensor array was used to collect volatile gases from a baijiu sample to obtain multi-channel olfactory time-series data. A multi-step pulse voltage excitation was applied to the baijiu sample using a voltammetric electronic tongue, and the current response of the working electrode was collected to obtain taste current time series data. Under standardized optical conditions, an industrial camera is used to photograph a sample of baijiu (Chinese liquor) placed in a transparent container to obtain visual image data.

[0011] Preferably, in step S2, the preprocessing includes: The Savitzky-Golay smoothing algorithm was used to denoise the olfactory time series data; The time series data of taste current were denoised using a wavelet thresholding algorithm. The median filtering algorithm is used to denoise the visual image data.

[0012] Preferably, in step S2, extracting the olfactory feature vector includes: for each independent channel of the olfactory time series data, extracting time-domain features including the mean, maximum, variance, integral, maximum difference, and mean absolute value derivative; and concatenating the time-domain features of all channels in channel order to form the olfactory feature vector. Extracting the taste feature vector includes: extracting the steady-state response current, cumulative charge, and redox symmetry ratio features for the current response of each pulse excitation cycle; and concatenating the features of all pulse excitation cycles in cyclic order to form the taste feature vector. Extracting visual feature vectors includes: extracting the mean yellowness, color uniformity, relative transmittance, image information entropy features, and suspended matter ratio features from the image; and concatenating all features in a preset order to form a visual feature vector.

[0013] Preferably, S3 includes: Principal component analysis (PCA) was used to reduce the dimensionality of olfactory, gustatory, and visual feature vectors, retaining those with a cumulative variance contribution rate not lower than a preset threshold. Principal components; The entropy weight method is used to calculate the weight coefficients of each modal feature after dimensionality reduction, and then weighted. The weighted feature vectors of each modality are concatenated to generate a multimodal digital feature vector.

[0014] Preferably, the entropy weight method is used to calculate the weight coefficients of each modal feature after dimensionality reduction, and the weight coefficient of the j-th feature is... Represented as:

[0015] in, , Let be the information utility values ​​of the j-th and k-th features, respectively, and , For information entropy, Let m be the weight of the i-th sample on the j-th feature, m be the number of samples, and p be the number of features.

[0016] Preferably, in step S4, the multi-task attention fusion network includes: Cross-attention interaction module: used to calculate the interaction features between any two modalities through the cross-attention mechanism; Multi-task adaptive aggregation module: used to aggregate the interaction features and generate deep fusion features for classification and regression tasks; Task-specific output module: used to process the deep fusion features for classification and regression tasks, and output quality level probabilities and sensory scores.

[0017] Preferably, the data processing procedure of the cross-attention interaction module includes: For modes a and b, the feature vectors Through learnable parameter matrix Mapped to query vector and the feature vector Through learnable parameter matrix , Mapped to key vectors respectively , ; Based on query vector and key vector The attention weights are calculated. ; Using attention weights Log-value matrix Weighted fusion is performed to obtain the initial cross-modal interaction features. ; Initial cross-modal interaction features Processed by Dropout operation and combined with feature vector Residual connections are performed, and the results of these residual connections are then subjected to layer normalization to obtain the interaction features. .

[0018] Preferably, the multi-task adaptive aggregation module includes: Classification aggregation unit: This unit combines each group of interaction features. The system sequentially performs nonlinear transformations through a fully connected layer and max pooling through a max pooling layer. All max pooling results are then concatenated and fused through a fully connected layer to generate a classification deep fusion feature. ; Regression aggregation unit: for each set of interaction features The system sequentially performs nonlinear transformations through a fully connected layer and average pooling through an average pooling layer. All average pooling results are then concatenated and fused through a fully connected layer to generate a deep regression feature. .

[0019] Preferably, the task-specific output module includes: The classification head consists of two fully connected layers and a Gaussian error linear unit activation function, which deeply fuses features for classification. The process is performed to output the probability distribution of samples belonging to superior, first-level, and second-level categories. ; The regression head consists of four parallel scoring units, each comprising two fully connected layers, corresponding to different sensory scores, and outputting color and appearance scores. Odor rating Taste and mouthfeel rating Style rating .

[0020] As can be seen from the above technical solution, compared with the prior art, the technical solution of the present invention has the following beneficial effects: 1. This method simultaneously collects digital data from multiple modalities, including smell, taste, and vision, and employs targeted preprocessing and feature extraction strategies to construct structured feature vectors. Furthermore, through dimensionality reduction and weighted fusion based on entropy weighting, a unified multimodal digital feature vector is generated. This overcomes the limitations of single-sensory detection and integrates isolated information about smell, taste, and appearance into a digital fingerprint that can comprehensively characterize the overall quality of baijiu.

[0021] 2. The multi-task attention fusion network adopted explicitly models the complex relationships between features of different sensory modalities through the cross-attention interaction module, and generates differentiated deep fusion features for classification and regression tasks through the multi-task adaptive aggregation module. It can accurately output level judgments and multi-dimensional scores, and realize a high-efficiency mapping from multi-modal data to evaluation conclusions.

[0022] 3. The entire method, from sample preparation and standardization of the environment and parameters for data collection to data processing and model analysis, has established objective and unified operating procedures and calculation criteria. The final evaluation results are derived from the model calculation of standardized digital features, thereby effectively eliminating the influence of individual differences and state fluctuations of the evaluators, and demonstrating a high degree of consistency in the evaluation of samples tested at different times and in different batches. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0024] Figure 1 A flowchart of a machine-based baijiu (Chinese liquor) tasting method based on multimodal fusion and deep learning is provided in an embodiment of the present invention. Figure 2 The multi-channel electrical signal response curves collected by the olfactory detection device provided in this embodiment of the invention; Figure 3 A schematic diagram of the excitation signal applied to the working electrode in the taste detection device provided in an embodiment of the present invention; Figure 4 A schematic diagram of the current response signal collected by the taste detection device provided in an embodiment of the present invention; Figure 5This is a schematic diagram of a multi-task attention fusion network structure provided in an embodiment of the present invention. Detailed Implementation

[0025] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0026] like Figure 1 As shown, this embodiment provides a method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning, including the following steps: S1. Based on multiple batches of baijiu samples, obtain the corresponding olfactory time series data, taste current time series data and visual image data; S2. Preprocess the olfactory time series data, gustatory current time series data and visual image data, and extract olfactory feature vectors, gustatory feature vectors and visual feature vectors based on the preprocessed data. S3. Perform dimensionality reduction processing on the olfactory feature vector, gustatory feature vector and visual feature vector, and perform weighted fusion on the dimensionality-reduced features to generate a multimodal digital feature vector. S4. Using the corresponding quality level and sensory score as training labels, train a multi-task attention fusion network using the multimodal digital feature vector; S5. Input the multimodal digital feature vector corresponding to the liquor sample to be tested into the trained multi-task attention fusion network to obtain the corresponding evaluation results.

[0027] This method collects standardized olfactory, gustatory, and visual data of baijiu (Chinese liquor), constructs multimodal digital feature vectors, and trains and predicts them using a dedicated multi-task attention fusion network. This enables objective, quantitative, and automated evaluation of baijiu quality. It effectively integrates multidimensional sensory information, reduces the subjectivity and volatility of traditional manual evaluation, and improves the consistency and repeatability of evaluation results.

[0028] The following provides further explanation of each step and related features in the above method; In this embodiment S1, based on multiple batches of baijiu samples, corresponding olfactory time-series data, gustatory current time-series data, and visual image data are obtained; including: In this embodiment, several representative batches of baijiu samples are selected as test objects. For example, several samples of premium, first-grade, and second-grade sauce-flavored baijiu can be selected. Before testing, the samples need to be placed in a constant environment with a temperature of 16°C to 26°C and a humidity of 40% to 70% for at least 12 hours to stabilize their physicochemical state and reduce the impact of environmental fluctuations on sensory characteristics. Specifically, olfactory data acquisition utilizes a 30-channel metal-oxide-semiconductor sensor array to detect volatile gases after headspace equilibration. Headspace equilibration involves sealing a 0.5ml wine sample in a 20ml headspace vial for 12 hours. Standardized cleaning, baseline, and sampling procedures were further established, including 600s of cleaning, followed by 300s of sampling at a flow rate of 800ml / min, with 1s intervals, and real-time recording of the electrical signal response of each channel. Figure 2 As shown, the olfactory acquisition yielded a cluster of multi-channel time-series response curves. The 30 curves, each with a different shape, together constitute an electronic nasal fingerprint that characterizes the aroma profile of the sample. The visual features such as response peaks, curve slopes, and areas contain rich information about volatile components. Taste data is acquired via a voltammetric electronic tongue, employing a customized multi-step, large-amplitude pulse voltammetry excitation signal, including 10 pulses, scanning in both positive and negative directions, with amplitudes ranging from ±200mV to ±1000mV. The current response of the working electrode is recorded in real time, forming a time-series data of taste current. Figure 3 As shown, this is a multi-step, large-amplitude pulsed voltammetric excitation signal applied to the working electrode, clearly displaying the positive and negative step-pulse sequences and the zero-potential relaxation period; as... Figure 4 As shown, the current response signal collected under this excitation signal is displayed, which directly reflects the electrochemical behavior of the wine sample at different potentials. The amplitude of the current, the shape of the impulse response and the relaxation characteristics are the basis for the subsequent extraction of taste features. Visual data acquisition is conducted in a closed, light-shielded box. A high-resolution industrial camera is used to photograph wine samples placed in a standard optical container under strictly controlled D65 standard light source, with locked exposure parameters, an exposure time of 20-50ms, and a gain of 0dB, to obtain high-quality, interference-free two-dimensional digital images.

[0029] This step achieves the simultaneous and standardized acquisition of olfactory, gustatory, and visual information. Pre-processing of sample temperature and humidity balance eliminates signal fluctuations caused by non-quality factors. Furthermore, differentiated and optimized excitation and capture mechanisms are designed for each modality. Olfactory acquisition, through headspace balancing and constant-current injection, simulates and amplifies the human aroma-smelling process. Gustatory acquisition, through a carefully designed pulse voltage sequence, can comprehensively excite various electrochemically active substances in the wine, with an information abundance far exceeding that of single-voltage detection. Visual acquisition, by constructing a favorable optical environment, transforms subjective judgments of color and clarity into quantifiable pixel matrices. This rigorous data acquisition standard is a reliable foundation for subsequent data processing and model building, effectively avoiding model bias or failure due to data quality issues.

[0030] In this embodiment S2, the olfactory time series data, gustatory current time series data and visual image data are preprocessed, and olfactory feature vectors, gustatory feature vectors and visual feature vectors are extracted based on the preprocessed data. The preprocessing process involves targeted noise reduction based on the characteristics of each modality of data. For olfactory time-series data that exhibits relatively smooth changes but is susceptible to high-frequency noise interference, the Savitzky-Golay smoothing algorithm is employed. This algorithm can filter out noise while preserving key morphological information such as the peak shape and width of the response curve to the greatest extent possible. For non-stationary taste current signals containing step pulses, a wavelet thresholding denoising algorithm is used. This algorithm utilizes multi-resolution analysis characteristics to effectively eliminate random electromagnetic interference while preserving transient details such as the steep edges of the pulses. For visual images that may contain salt and pepper noise, a median filtering algorithm is used. This algorithm can remove isolated noise points while effectively maintaining the contours of the wine's edges and tiny suspended particles.

[0031] Specifically, extracting olfactory feature vectors includes: for each independent channel of the olfactory time series data, extracting time-domain features including mean, maximum, variance, integral, maximum difference, and mean absolute derivative; assuming the signal of a single channel is... , Let be the response signal of a sensor at time t. For collection duration; The specific calculation formulas for each time-domain feature are shown in Table 1 below: Table 1

[0032] By sequentially piecing together these features from all channels, a high-dimensional olfactory feature vector is formed, which comprehensively characterizes the intensity, stability, and dynamic changes of the aroma.

[0033] Furthermore, the extraction of the taste feature vector includes: extracting the steady-state response current, cumulative charge, and redox symmetry ratio features for the current response of each pulse excitation cycle; specifically including: Define k as the pulse number (k=1,2,…,10), and t as the time variable relative to the start time of each pulse. This represents the instantaneous current data within the k-th pulse cycle; The steady-state response current characteristic is used to characterize the Faraday current intensity when the electrochemical reaction on the electrode surface reaches a quasi-equilibrium state under a specific potential excitation, reflecting the equilibrium concentration of the analyte; for the k-th pulse, the last pulse before the pulse ends is captured. The arithmetic mean of the current data over a period of time is calculated and denoted as . :

[0034] in, For the k-th pulse at time... The instantaneous current measurement value, N is Number of sampling points within the time window; The cumulative charge characteristic quantifies the total amount of electrons transferred to the electrode reaction within a single pulse duration by calculating the integral area under the current curve. Compared to single-point current, the integral characteristic effectively smooths high-frequency random noise and improves the signal-to-noise ratio. For k pulses, its cumulative charge characteristic... Represented as:

[0035] in, The duration of the pulse; Since the excitation signal contains two symmetrical step-scan processes, one positive and one negative, the redox symmetry ratio feature is used to characterize the difference in redox reversibility of the signal at potentials with the same amplitude but opposite polarities. Corresponding pulse pairs with the same absolute amplitude are selected, and the ratio of their absolute steady-state currents is calculated. :

[0036] in, , This represents the steady-state current value corresponding to a negative potential pulse; By splicing together these features from all pulse cycles, a taste feature vector is formed, which quantifies the flavor profile of the wine from an electrochemical perspective.

[0037] Furthermore, the extraction of visual feature vectors includes: extracting the mean yellowness, color uniformity, relative transmittance, image information entropy features, and suspended matter proportion features from the image; To objectively evaluate the typical "slightly yellow" and "transparent" characteristics of baijiu, the ROI image was non-linearly converted from the RGB color space to CIE L. a b Uniform color space, where L Represents brightness, a Represents the red-green components, b Representing the yellow and blue components, since the aging degree of baijiu is usually positively correlated with the depth of its yellow color, the focus is on extracting b. The mean and standard deviation of the channels, let P(x,y) be the pixel at coordinates (x,y) within the ROI region, , be the total number of pixels within the ROI region, and be the mean yellowness. Represented as:

[0038] in, The single pixel at (x,y) corresponds to b in the CIELAB color space. Component values; Color uniformity Represented as:

[0039] The higher the value, the more yellow the color of the wine; The smaller the value, the more evenly the color of the wine is distributed; Relative transmittance is used to quantify the clarity of baijiu (Chinese liquor). A color image is converted to a grayscale image, and using the principle of backlight transmission imaging, the clearer the liquor, the higher the grayscale value; the more cloudy the liquor, the greater the light attenuation, and the lower the grayscale value. The average grayscale transmittance is defined. :

[0040] in, It is linearly positively correlated with the transmittance index and is used for rapid screening of turbid samples; To detect minute suspended particles or extremely slight loss of luster that are difficult to detect with the naked eye, this embodiment introduces image information entropy features to characterize the "purity" and "chaos" of the wine. When the wine is absolutely pure, the texture is smooth and the information entropy is low; when there are tiny particles or flocculent sediments, the image texture complexity increases, and the information entropy increases significantly. The calculation formula is as follows:

[0041] Where i represents the gray level. This represents the probability distribution of gray level i in the ROI region. This feature is extremely sensitive to diffuse reflection caused by tiny particles and is a key indicator for determining whether there is microscopic sedimentation in the wine. Furthermore, the feature of suspended matter proportion is introduced. For possible visually visible impurities, an adaptive threshold segmentation algorithm is used to binarize the ROI, and dark spots with gray values ​​below a certain threshold are identified as suspected impurity regions. The total pixel area of ​​the suspected impurity regions is then calculated. Total ROI area The ratio of these values ​​is denoted as the impurity rate. :

[0042] All features are spliced ​​together in a preset order to form a visual feature vector, which objectively corresponds to the color, clarity and purity of the wine.

[0043] Furthermore, after feature extraction, to eliminate differences in the amplitude and dimensions of responses from different sensors, the extracted raw olfactory, gustatory, and visual features were standardized using the Z-score normalization method.

[0044] in, These are the original eigenvalues. This is the mean of the feature across all training samples. Let z be its standard deviation, and z be the standardized eigenvalue.

[0045] Furthermore, this step can dynamically optimize the feature selection combination based on the correlation analysis between features and human evaluation results, eliminate redundant or irrelevant features, and further improve the representation efficiency of the feature set.

[0046] In this embodiment, S3, the olfactory feature vector, gustatory feature vector, and visual feature vector are subjected to dimensionality reduction processing, and the dimensionality-reduced features are weighted and fused to generate a multimodal digital feature vector; including: Principal component analysis (PCA) was used to reduce the dimensionality of olfactory, gustatory, and visual feature vectors, retaining those with a cumulative variance contribution rate not lower than a preset threshold. Principal components; specifically, PCA transforms the original correlated variables into a set of linearly uncorrelated principal components through orthogonal transformation, and sorts them according to their variance contribution rate. By setting a cumulative variance contribution rate threshold of 95%, only the top k most important principal components are retained, thereby maximizing the preservation of original information while significantly reducing feature dimensions and improving the efficiency and stability of subsequent model processing. The entropy weight method is used to calculate the weight coefficients of each modal feature after dimensionality reduction, and then weighted. The entropy weight method is used to calculate the weight coefficients of each modal feature after dimensionality reduction, and the weight coefficient of the j-th feature is... Represented as:

[0047] in, , Let be the information utility values ​​of the j-th and k-th features, respectively, and , For information entropy, Let m be the weight of the i-th sample on the j-th feature, m be the number of samples, and p be the number of features; The weights are calculated based on the distribution characteristics of the data itself, avoiding subjective bias and ensuring the objectivity of feature fusion. For example, if a certain visual feature differs significantly between different grades of wine samples, its weight will automatically increase, giving it a more important position in the fused features.

[0048] The weighted feature vectors of each modality are concatenated to generate a multimodal digital feature vector, which simultaneously encodes the aroma chemical fingerprint, taste electrochemical profile, and appearance physical properties of the liquor sample, forming a digital identity card that can comprehensively and complementaryly characterize the overall quality of the liquor, providing high-quality input for deep learning models.

[0049] In this embodiment, S4, the corresponding quality level and sensory score are used as training labels, and the multi-modal digital feature vector is used to train the multi-task attention fusion network. like Figure 5 As shown, the multi-task attention fusion network includes: Cross-attention interaction module: used to calculate the interaction features between any two modalities through the cross-attention mechanism; Multi-task adaptive aggregation module: used to aggregate the interaction features and generate deep fusion features for classification and regression tasks; Task-specific output module: used to process the deep fusion features for classification and regression tasks, and output quality level probabilities and sensory scores.

[0050] Furthermore, the data processing procedure for the cross-attention interaction module includes: For modes a and b, the feature vectors Through learnable parameter matrix Mapped to query vector and the feature vector Through learnable parameter matrix , Mapped to key vectors respectively , ; This place , , , These represent the olfactory modality, the gustatory modality, and the visual modality, respectively. Based on query vector and key vector The attention weights are calculated. ; Using attention weights Log-value matrix Weighted fusion is performed to obtain the initial cross-modal interaction features. ; Initial cross-modal interaction features Processed by Dropout operation and combined with feature vector Residual connections are performed, and the results of these residual connections are then subjected to layer normalization to obtain the interaction features. .

[0051] Furthermore, the multi-task adaptive aggregation module includes: Classification aggregation unit: This unit combines each group of interaction features. The system sequentially performs nonlinear transformations through a fully connected layer and max pooling through a max pooling layer. All max pooling results are then concatenated and fused through a fully connected layer to generate a classification deep fusion feature. ; Regression aggregation unit: for each set of interaction features The system sequentially performs nonlinear transformations through a fully connected layer and average pooling through an average pooling layer. All average pooling results are then concatenated and fused through a fully connected layer to generate a deep regression feature. .

[0052] Furthermore, task-specific output modules include: The classification head consists of two fully connected layers and a Gaussian error linear unit activation function, which deeply fuses features for classification. The process is performed to output the probability distribution of samples belonging to superior, first-level, and second-level categories. ; The regression head consists of four parallel scoring units, each comprising two fully connected layers, corresponding to different sensory scores, and outputting color and appearance scores. Odor rating Taste and mouthfeel rating Style rating .

[0053] Furthermore, the training of the multi-task attention fusion network employs a composite loss function. :

[0054] Where α, β, and γ are preset weighting coefficients; Using the cross-entropy loss function, calculate the predicted probability distribution. With the true quality grade label The differences between them; To smooth the L1 loss function, the difference between each sensory predicted rating and the actual rating label is calculated; To determine the modality consistency constraint loss, we calculate the class-deep fusion features. Deep fusion features with regression The negative cosine similarity between features is used to constrain the semantic consistency of features learned from different tasks.

[0055] This multi-task attention fusion network, by introducing a cross-attention mechanism, actively models and utilizes cross-modal associations such as "smell-taste" and "vision-smell" to simulate the process of synesthetic and synthetic sensory information in evaluation. This is superior to the traditional approach of simply splicing features directly into a fully connected network. Furthermore, recognizing the differences between classification and regression tasks, it designs a differentiated feature aggregation strategy, enabling the network to extract the most suitable feature representation for each task. Combined with modality consistency constraint loss, the high-level features learned from the perspectives of classification and regression tasks cannot contradict each other in semantic space, playing an implicit alignment role and improving the model's generalization ability and internal consistency.

[0056] In this embodiment, S5, the multimodal digital feature vector corresponding to the liquor sample to be tested is input into the trained multi-task attention fusion network to obtain the corresponding evaluation result.

[0057] Its specific implementation steps include: 1) Standardized feature generation and model input; When evaluating the baijiu samples to be tested, the first step is to strictly follow the same process as the model training phase to generate the corresponding multimodal digital feature vectors. Specifically, after equilibrating the test samples under the same environmental conditions, the same sensor array, electronic tongue, and visual acquisition system are used to collect their raw olfactory, gustatory, and visual data according to the same parameters. Subsequently, the same preprocessing algorithm, feature extraction formula, Z-score normalization parameters, and PCA dimensionality reduction model and entropy weighting coefficients determined in the training phase are used to process the raw data of the test samples, and finally generate a multimodal digital feature vector composed of olfactory, gustatory, and visual components, which is used as the input of the multi-task attention fusion network.

[0058] 2) Network forward propagation and result generation; The multimodal digital feature vector of the sample to be tested is input into a pre-trained and parameter-frozen multi-task attention fusion network. The network automatically performs forward propagation: First, the cross-attention interaction module calculates the interaction relationship between the sample-specific olfactory, gustatory, and visual features; then, the multi-task adaptive aggregation module generates deep fusion features for classification and regression tasks through max pooling and average pooling paths, respectively; finally, the task-specific output module processes these two sets of features in parallel. The network output consists of two sets of explicit quantitative results: a three-dimensional probability distribution vector, such as [0.05, 0.90, 0.05], indicating that the sample is classified as excellent, first-grade, and second-grade with probabilities of 5%, 90%, and 5%, respectively; and a four-dimensional score vector, such as [4.2, 25.8, 40.5, 11.3], corresponding to the predicted scores for its color and appearance, smell, taste and texture, and style, respectively.

[0059] 3) Result interpretation and application output; Based on the quantitative results output by the network, a final evaluation report is generated. The grade with the highest probability is determined as the quality grade of the sample. At the same time, the scores of each sensory dimension intuitively reflect the performance of the sample in specific indicators. For example, an aroma score of 25.8 (out of 30) indicates that its aroma performance is excellent. This process requires no human intervention and can complete an objective and quantitative evaluation of new samples in a short time. The output can be directly used for online quality grading, batch consistency inspection, or product quality benchmarking analysis in liquor production, realizing the transformation from sensory evaluation based on experience to intelligent decision-making based on data models.

[0060] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the systems disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple; relevant parts can be referred to the method section.

[0061] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for machine-based baijiu (Chinese liquor) tasting based on multimodal fusion and deep learning, characterized in that, Includes the following steps: S1. Based on multiple batches of baijiu samples, obtain the corresponding olfactory time series data, taste current time series data and visual image data; S2. Preprocess the olfactory time series data, gustatory current time series data and visual image data, and extract olfactory feature vectors, gustatory feature vectors and visual feature vectors based on the preprocessed data. S3. Perform dimensionality reduction processing on the olfactory feature vector, gustatory feature vector and visual feature vector, and perform weighted fusion on the dimensionality-reduced features to generate a multimodal digital feature vector. S4. Using the corresponding quality level and sensory score as training labels, train a multi-task attention fusion network using the multimodal digital feature vector; S5. Input the multimodal digital feature vector corresponding to the liquor sample to be tested into the trained multi-task attention fusion network to obtain the corresponding evaluation results.

2. The method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning according to claim 1, characterized in that, S1 includes: A metal oxide semiconductor sensor array was used to collect volatile gases from a baijiu sample to obtain multi-channel olfactory time-series data. A multi-step pulse voltage excitation was applied to the baijiu sample using a voltammetric electronic tongue, and the current response of the working electrode was collected to obtain taste current time series data. Under standardized optical conditions, an industrial camera is used to photograph a sample of baijiu (Chinese liquor) placed in a transparent container to obtain visual image data.

3. The method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning according to claim 1, characterized in that, In step S2, the preprocessing includes: The Savitzky-Golay smoothing algorithm was used to denoise the olfactory time series data; The time series data of taste current were denoised using a wavelet thresholding algorithm. The median filtering algorithm is used to denoise the visual image data.

4. The method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning according to claim 1, characterized in that, In step S2, extracting the olfactory feature vector includes: for each independent channel of the olfactory time series data, extracting time-domain features including the mean, maximum, variance, integral, maximum difference, and mean absolute value derivative; and concatenating the time-domain features of all channels in channel order to form the olfactory feature vector. Extracting the taste feature vector includes: extracting the steady-state response current, cumulative charge, and redox symmetry ratio features for the current response of each pulse excitation cycle; and concatenating the features of all pulse excitation cycles in cyclic order to form the taste feature vector. Extracting visual feature vectors includes: extracting the mean yellowness, color uniformity, relative transmittance, image information entropy features, and suspended matter ratio features from the image; and concatenating all features in a preset order to form a visual feature vector.

5. The method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning according to claim 1, characterized in that, S3 includes: Principal component analysis (PCA) was used to reduce the dimensionality of olfactory, gustatory, and visual feature vectors, retaining those with a cumulative variance contribution rate not lower than a preset threshold. Principal components; The entropy weight method is used to calculate the weight coefficients of each modal feature after dimensionality reduction, and then weighted. The weighted feature vectors of each modality are concatenated to generate a multimodal digital feature vector.

6. The method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning according to claim 5, characterized in that, The entropy weighting method is used to calculate the weight coefficients of each modal feature after dimensionality reduction, and the weight coefficient of the j-th feature is... Represented as: in, , Let be the information utility values ​​of the j-th and k-th features, respectively, and , For information entropy, Let m be the weight of the i-th sample on the j-th feature, m be the number of samples, and p be the number of features.

7. The method for machine tasting of baijiu based on multimodal fusion and deep learning according to claim 1, characterized in that, In S4, the multi-task attention fusion network includes: Cross-attention interaction module: used to calculate the interaction features between any two modalities through the cross-attention mechanism; Multi-task adaptive aggregation module: used to aggregate the interaction features and generate deep fusion features for classification and regression tasks; Task-specific output module: used to process the deep fusion features for classification and regression tasks, and output quality level probabilities and sensory scores.

8. The method for machine tasting of baijiu based on multimodal fusion and deep learning according to claim 1, characterized in that, The data processing procedure of the cross-attention interaction module includes: For modes a and b, the feature vectors Through learnable parameter matrix Mapped to query vector and the feature vector Through learnable parameter matrix , Mapped to key vectors respectively , ; Based on query vector and key vector The attention weights are calculated. ; Using attention weights Log-value matrix Weighted fusion is performed to obtain the initial cross-modal interaction features. ; Initial cross-modal interaction features Processed by Dropout operation and combined with feature vector Residual connections are performed, and the results of these residual connections are then subjected to layer normalization to obtain the interaction features. .

9. The method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning according to claim 1, characterized in that, The multi-task adaptive aggregation module includes: Classification aggregation unit: This unit combines each group of interaction features. The system sequentially performs nonlinear transformations through a fully connected layer and max pooling through a max pooling layer. All max pooling results are then concatenated and fused through a fully connected layer to generate a classification deep fusion feature. ; Regression aggregation unit: for each set of interaction features The system sequentially performs nonlinear transformations through a fully connected layer and average pooling through an average pooling layer. All average pooling results are then concatenated and fused through a fully connected layer to generate a deep regression feature. .

10. The method for machine tasting of baijiu (Chinese liquor) based on multimodal fusion and deep learning according to claim 1, characterized in that, The task-specific output module includes: The classification head consists of two fully connected layers and a Gaussian error linear unit activation function, which deeply fuses features for classification. The process is performed to output the probability distribution of samples belonging to superior, first-level, and second-level categories. ; The regression head consists of four parallel scoring units, each comprising two fully connected layers, corresponding to different sensory scores, and outputting color and appearance scores. Odor rating Taste and mouthfeel rating Style rating .