Artificial intelligence-based beverage recommendation method and system
By combining cross-modal dynamic alignment and a dual-branch feature extraction network with a gated cross-attention mechanism and a dynamic adjustment strategy, the problems of missing semantic associations between modalities and fixed fusion weights in beverage recommendation systems are solved, achieving more accurate and personalized beverage recommendations.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XI AN PENGPAIYUEDONG ELECTRONIC TECH CO LTD
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244611A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image recognition or understanding, and specifically to a beverage recommendation method and system based on artificial intelligence. Background Technology
[0002] With the development of artificial intelligence technology, especially in the fields of computer vision and deep learning, image processing and analysis have been widely applied to various recommendation systems. However, in the field of beverage recommendation, accurately capturing users' personalized needs and providing recommendations that match their taste and health requirements remains a challenging task. Traditional beverage recommendation systems mainly rely on users' personal preferences, historical data, and basic characteristics such as age and gender for simple recommendations. Although this method can provide some personalized recommendations in certain situations, its accuracy and personalization are still insufficient. Especially when facing complex scenarios such as different types of beverages and differences in users' physical conditions, existing methods often fail to fully consider subtle differences among users, thus affecting the accuracy and practicality of the recommendation results.
[0003] Chinese invention patent CN111563195A discloses a beverage recommendation method, device, and computer-readable storage medium. The method includes: acquiring beverage data from a user within a preset time period; determining the user's preference for various beverages based on the beverage data; determining the user's preferred beverages based on the user's preference for various beverages; and generating a recommendation list containing beverages to be recommended and their transaction links based on the user's preferred beverages. This prior art, when simultaneously using facial and tongue images for beverage recommendations, often employs simple feature splicing or independent processing to fuse multimodal data, making it difficult to capture deep semantic relationships between different modalities. Furthermore, traditional single-network structures cannot simultaneously consider the global illumination features of facial images and the local texture details of tongue images. In addition, existing solutions typically have fixed fusion weights for different modal features, failing to dynamically adjust the dependence of various features on different beverage categories, resulting in insufficient accuracy and personalization of the recommendation results.
[0004] Therefore, existing technologies still have significant limitations when facing complex application scenarios. First, when multiple data sources are introduced, such as simultaneously collecting facial and tongue images of users, existing methods typically use simple feature splicing or independent processing for multimodal fusion, which struggles to effectively capture the deep semantic relationships between different modalities, resulting in insufficient feature representation capabilities after fusion. Second, facial images are easily affected by factors such as lighting conditions and shooting angles, while tongue images contain a wealth of subtle texture and color distribution information. Traditional single-network structures struggle to simultaneously address the extraction needs of both global features and local details. Furthermore, existing solutions often use fixed or preset weights for the fusion of features from different modalities, failing to dynamically adjust the weights based on the dependence of different beverage categories on various features, thus affecting the accuracy and personalization of the recommendation results. Summary of the Invention
[0005] This application provides an artificial intelligence-based beverage recommendation method and system. This method effectively solves the problem of missing semantic associations between modalities in traditional methods, providing semantically consistent input features for subsequent fusion.
[0006] In a first aspect, the present invention provides a beverage recommendation method based on artificial intelligence, comprising the following steps: Collect user's facial and tongue images; The facial image and the tongue image are subjected to cross-modal dynamic alignment processing to obtain aligned cross-modal fusion features; The aligned cross-modal fusion features are input into a dual-branch feature extraction network, which includes a face branch and a tongue branch. The face branch is used to perform multi-scale global feature extraction on the cross-modal fusion features to obtain facial depth features, and the tongue branch is used to perform local texture feature extraction on the cross-modal fusion features to obtain tongue depth features. The face branch uses multi-scale dilated convolution to extract illumination-invariant features, and the tongue branch uses group convolution and channel attention mechanisms to focus on local fine texture features. A gated cross-attention mechanism is used to dynamically interact and fuse the facial depth features and the tongue coating depth features to generate user fused features. Based on the user fusion characteristics, the corresponding beverage recommendation results are generated and output.
[0007] As a further improvement of the present invention, the cross-modal dynamic alignment processing of the facial image and the tongue image includes: Extract the convolutional features of the facial image and the gradient features of the tongue image, and perform channel-wise cross-correlation operation on the two to obtain the first intermediate feature; Extract the convolution features of the tongue image and the gradient features of the facial image, and perform a Hadamard product operation on the two to obtain the second intermediate feature; Based on the content of the facial image and the tongue image, weight coefficients are dynamically generated, and the first intermediate feature and the second intermediate feature are weighted and fused based on the weight coefficients to obtain the aligned cross-modal fusion feature; As a further improvement of the present invention, the aligned cross-modal fusion feature is calculated in the following manner:
[0008] In the formula, Input a facial image; For the aligned cross-modal fusion feature tensor; Represents the dimension of data features. The length in pixels of the facial image. The width in pixels of the facial image; To modify the activation function of the linear unit; Input a tongue coating image; The length (in pixels) of the tongue coating image. The width (in pixels) of the tongue coating image; The Sobel gradient operator is used for the texture characteristics of tongue cracks and facial edges. It is a feature extraction layer with a 3x3 convolutional kernel; For channel-by-channel cross-correlation calculation; For Hadamard product operations; These are dynamic weighting coefficients; It is a multilayer perceptron; This is a global average pooling operation.
[0009] As a further improvement of the present invention, the step of dynamically generating weight coefficients includes: generating the weight coefficients using a multilayer perceptron based on the global pooling features of the facial image and the tongue image.
[0010] As a further improvement of the present invention, the facial branch uses multi-scale dilated convolution to perform parallel convolution on the input features with multiple different dilation rates, and fuses the convolution results to extract global illumination invariant features containing different receptive fields. The tongue coating branch employs a grouped convolution and channel attention mechanism. It extracts local texture information of the input features through grouped convolution and assigns weights to different channels of the feature map after grouped convolution through a channel attention layer to focus on key texture features.
[0011] As a further improvement to the present invention, the multi-scale dilated convolution used for the facial branches extracts features in the following manner:
[0012] In the formula, For expansion rate Hollow convolution, For expansion rate, To capture local, mid-range, and global illumination invariance features respectively; For the first convolutional neural network Layer-specific branch output characteristics; For the first convolutional neural network Input features after cross-modal alignment; The batch normalization operation function is used to align the distribution of multi-scale features and alleviate the offset of feature distribution under different lighting conditions. The expansion rate in facial branches The dilated convolution kernel weight parameters.
[0013] As a further improvement to the present invention, the tongue coating branches are extracted using the following method:
[0014] In the formula, For the first convolutional neural network Layered tongue coating branch output features For grouped convolution, For compression-excitation module; The weight parameters of the grouped convolution kernel in the tongue coating branch are denoted as .
[0015] As a further improvement to the present invention, the use of a gated cross-attention mechanism for dynamic interaction fusion includes: The facial depth features are projected into a query vector, and the tongue coating depth features are projected into a key vector and a value vector to calculate the attention weight of the face to the tongue coating. The tongue coating depth features are projected into a query vector, and the facial depth features are projected into a key vector and a value vector to calculate the attention weight of the tongue coating on the face. Based on the attention weight of the face to the tongue coating and the attention weight of the tongue coating to the face, the value vectors of the facial depth features and the tongue coating depth features are weighted and fused to generate the user fusion features.
[0016] As a further improvement of the present invention, the user fusion feature is calculated in the following manner:
[0017]
[0018]
[0019] In the formula, , representing the query vector for the facial branch; , representing the key vector of the tongue coating branch; , represents the value vector of the tongue coating branch; , representing the value vector of the facial branch; Features resulting from dynamic interaction and fusion; For layer normalization operation; The projection weight matrix of the tongue coating branch key vector; The projection weight matrix is the vector of tongue coating branch values. The projection weight matrix for the facial branch query vector; The projection weight matrix is the facial branch value vector. The final feature of the facial branch; This is the final characteristic of tongue coating branches; For cross-modal attention weight matrix, calculate the attention weight of face to tongue coating. For example, if the face is pale, the blood color feature of the tongue coating is needed for assistance. As the inverse cross-modal attention weight matrix, the attention weights of the tongue coating on the face are calculated. When the tongue coating image is blurred, Automatically reduces the contribution of tongue coating branches, relying on facial branches; For the Softmax function; This is the scaling factor; It is a feedforward network; for and Feature splicing can be used to enhance the combination of beverage-related characteristics. For example, if the complexion is dull and the tongue coating is yellow and greasy, herbal teas are recommended.
[0020] As a further improvement of the present invention, when training the dual-branch feature extraction network, a decoupling loss function is used to optimize the dual-branch feature extraction network. The decoupling loss function is used to constrain the identity features of different samples of the same user to be consistent, while simultaneously constraining the beverage preference features of different users to be far apart.
[0021] As a further improvement of the present invention, the decoupling loss function is expressed as:
[0022] In the formula, This is the decoupling loss function; Expressing expectations; Constraining the consistency of identity features, individual differences in facial images should not affect beverage recommendation results, forcing the identity encoding of different samples of the same user (such as faces under different lighting conditions) to be similar; Through negative The distance maximizes the difference in beverage features between different samples, forcing the network to separate two types of features, maximizing the distance between beverage features of different users, and strengthening the physical constitution-related features; Fusion features of different enhanced samples for the same user; It is an L2 norm; It is an identity encoder, consisting of two fully connected layers and LeakyReLU; It is a beverage encoder, with the same structure as the identity encoder, but with independent parameters; Features of different samples from the same user.
[0023] As a further improvement of the present invention, when training the dual-branch feature extraction network, an orthogonal constraint loss function is used to force the feature space used to represent identity and the feature space used to represent beverage preferences to be orthogonal to each other. The orthogonal constraint loss function is expressed as follows:
[0024] In the formula, The orthogonal constraint loss function; It is the Frobenius norm. The identity and beverage feature spaces are forced to be orthogonal to avoid identity-related variables (such as facial scars) from contaminating beverage features; This is a batch identity feature matrix; For feature dimensions; This refers to the number of data items in the batch. This is a feature matrix for batch beverages.
[0025] As a further improvement of the present invention, when training the dual-branch feature extraction network, a classification loss function is used to optimize the dual-branch feature extraction network. The classification loss function introduces a temperature coefficient dynamically generated based on the input user fusion features, which is used to soften the classification probability distribution when the sample features are blurred.
[0026] As a further improvement to the present invention, the classification loss function is expressed as:
[0027] In the formula, For the first Logit values for each category; For the first Logit values for each category; The classification loss function; This represents the total number of beverage categories. For the first Each category has a unique hot-coded label; The temperature coefficient is dynamically generated based on the input features, when the sample features are blurred (e.g., due to light occlusion). Increase the value to soften the probability distribution, and conversely decrease it to enhance classification confidence. It is an exponential function with the natural constant as its base.
[0028] Secondly, the present invention provides an artificial intelligence-based beverage recommendation system for executing the aforementioned artificial intelligence-based beverage recommendation method, comprising: The image acquisition module is used to acquire images of the user's face and tongue. A cross-modal alignment module is used to perform cross-modal dynamic alignment processing on the facial image and the tongue image to obtain aligned cross-modal fusion features; The dual-branch feature extraction module includes a face branch and a tongue branch. The face branch is used to perform multi-scale global feature extraction on the cross-modal fusion features to obtain facial depth features, and the tongue branch is used to perform local texture feature extraction on the cross-modal fusion features to obtain tongue depth features. The dynamic feature interaction fusion module is used to dynamically interact and fuse the facial depth features and the tongue coating depth features using a gated cross-attention mechanism to generate user fused features. The recommendation output module is used to generate and output corresponding beverage recommendation results based on the user fusion features.
[0029] Thirdly, the present invention provides a training method for an artificial intelligence-based beverage recommendation model, comprising the following steps: Obtain labeled facial image samples and tongue coating image samples, wherein the labels include corresponding beverage category tags; The facial image samples and the tongue coating image samples are subjected to cross-modal dynamic alignment processing to obtain aligned cross-modal fusion features; The aligned cross-modal fusion features are input into a dual-branch feature extraction network. The dual-branch feature extraction network includes a face branch and a tongue branch. The face branch is used to perform multi-scale global feature extraction on the cross-modal fusion features to obtain facial depth features, and the tongue branch is used to perform local texture feature extraction on the cross-modal fusion features to obtain tongue depth features. A gated cross-attention mechanism is used to dynamically interact and fuse the facial depth features and the tongue coating depth features to generate user fused features. The total loss is calculated using a decoupling loss function and a classification loss function, and the parameters of the dual-branch feature extraction network are updated based on the total loss. The decoupling loss function is used to constrain the identity features of different samples of the same user to be consistent while simultaneously constraining the beverage preference features of different users to be far apart. The classification loss function is used to constrain the difference between the user fusion features and the beverage category label. Repeat the iteration until the preset stopping condition is met to obtain the trained beverage recommendation model.
[0030] Fourthly, this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement an artificial intelligence-based beverage recommendation method.
[0031] Fifthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements an artificial intelligence-based beverage recommendation method.
[0032] Sixthly, this application provides a computer program product, which includes computer instructions that instruct a computer to execute an artificial intelligence-based beverage recommendation method.
[0033] The beneficial effects of the technical solution proposed in this application are: This application introduces a cross-modal dynamic alignment strategy, utilizing gradient operators and dynamic weight coefficients to adaptively align facial and tongue images at the feature level. This effectively addresses the lack of semantic association between modalities in traditional methods, providing semantically consistent input features for subsequent fusion. This application constructs a dual-branch feature extraction network. The facial branch uses multi-scale dilated convolution to expand the receptive field and extract illumination-invariant features, while the tongue branch uses group convolution combined with channel attention to focus on local fine textures. This overcomes the limitation of a single network in simultaneously handling global and local feature extraction, significantly improving the comprehensiveness and accuracy of feature extraction. This application employs a gated cross-attention mechanism to achieve dynamic interactive fusion of facial and tongue features. It can adaptively adjust the fusion weights between modalities based on the input content, enabling accurate feature combination and enhancement even when different beverage categories have varying degrees of dependence on facial or tongue features. Furthermore, this application introduces decoupling loss functions, dynamic temperature coefficient classification loss, and gradient-sensitive weight decay strategies when training the dual-branch feature extraction network, further enhancing the model's robustness to individual differences and its ability to distinguish easily confused beverage categories. In summary, this application significantly improves the accuracy and personalization of beverage recommendations. Attached Figure Description
[0034] Figure 1 A flowchart of the AI-based beverage recommendation method provided for this application; Figure 2 A schematic diagram of the AI-based beverage recommendation system provided in this application; Figure 3 A flowchart illustrating the training method for the AI-based beverage recommendation model provided in this application; Figure 4 Figure showing the comparative experimental results of the impact of different cross-modal alignment strategies provided in this application on model performance; Figure 5 Figure showing the performance comparison of different network architectures provided in this application in terms of global semantic capture and local detail extraction; Figure 6 Figure 1 shows the experimental results comparing the classification performance of the dynamic temperature coefficient and fixed temperature coefficient strategies provided in this application on samples of different quality. Figure 7 A visualization of the experimental results of the feature decoupling loss function provided in this application on the separation of identity features and beverage feature space; Figure 8 A schematic diagram of an electronic device provided in this application. Detailed Implementation
[0035] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application. The step numbers in the following embodiments are set only for ease of explanation, and there is no limitation on the order between the steps. The execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.
[0036] In the description of this application, unless otherwise expressly defined, terms such as "setup," "installation," and "connection" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this application in conjunction with the specific content of the technical solution.
[0037] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application. The step numbers in the following embodiments are set only for ease of explanation, and there is no limitation on the order between the steps. The execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.
[0038] In the description of this application, unless otherwise expressly defined, terms such as "setup," "installation," and "connection" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this application in conjunction with the specific content of the technical solution.
[0039] Example 1 In existing beverage recommendation technologies, when facial and tongue images are used simultaneously as recommendation criteria, simple feature stitching or independent processing methods are often employed for multimodal fusion. This approach fails to effectively capture the deep semantic relationships between different modalities, such as the physiological correlation between facial skin color and tongue color, resulting in insufficient feature representation capabilities after fusion and affecting the accuracy of the recommendation results.
[0040] To solve this problem, such as Figure 1 As shown, this application proposes an artificial intelligence-based beverage recommendation method, including the following steps: First, images of the user's face and tongue are acquired using an image acquisition device.
[0041] In one embodiment, facial images are captured using a high-definition camera under controlled lighting conditions to ensure the collection of diverse samples under different lighting conditions; tongue images are captured using a high-resolution macro camera with precise control of the shooting angle and distance to ensure the clarity of the tongue coating texture and color distribution.
[0042] Next, cross-modal dynamic alignment processing is performed on the facial image and the tongue image to obtain the aligned cross-modal fusion features.
[0043] This step aims to address the semantic gap between the two modalities in the feature space, such as aligning skin color information in a facial image with tongue color information in a tongue image at the feature level.
[0044] Then, the aligned cross-modal fusion features are input into a dual-branch feature extraction network. This network contains two parallel but structurally heterogeneous branches: a face branch and a tongue branch. The face branch employs a multi-scale dilated convolutional structure, extracting features in parallel through convolutional kernels with different dilation rates. This effectively captures global skin color features of facial images under different lighting conditions, eliminating interference from local illumination abrupt changes. The tongue branch uses group convolution combined with a channel attention mechanism. Group convolution groups input features by channel and performs independent convolution, avoiding information confusion between color channels. The channel attention mechanism assigns higher weights to key texture channels (such as channels reflecting tongue cracks and thickness), thereby focusing on the local subtle texture features of the tongue. Through these two branches, facial depth features and tongue depth features are obtained, respectively.
[0045] Subsequently, a gated cross-attention mechanism is employed to dynamically interact and fuse these two deep features. This mechanism calculates the attention weights of facial features on tongue coating features and vice versa, achieving bidirectional information interaction and enhancement to generate a user fusion feature that comprehensively reflects the user's physical condition. Finally, this user fusion feature is input into a classifier, which outputs corresponding beverage recommendations, such as recommending black tea, green tea, or specific health-promoting teas. Through these steps, this method can fully utilize the complementary information of facial and tongue coating images to achieve accurate personalized beverage recommendations.
[0046] Furthermore, in cross-modal dynamic alignment processing, simple linear transformations or feature stitching are insufficient to establish complex nonlinear semantic relationships between face and tongue images. Therefore, this application specifies the steps for cross-modal dynamic alignment.
[0047] The specific implementation method for this step is as follows: First, a 3×3 convolution operation is performed on the facial image to extract its convolutional features, and the Sobel gradient operator is applied to the tongue image to extract its gradient features. Then, a channel-wise cross-correlation operation is performed between the convolutional features of the facial image and the gradient features of the tongue image to obtain the first intermediate feature. The channel-wise cross-correlation operation can enhance the matching degree of the two modalities in local structure, for example, associating the lip region in the facial image with the tongue tip region in the tongue image at the feature level.
[0048] Simultaneously, a 3×3 convolution operation is performed on the tongue image to extract its convolution features, and the Sobel gradient operator is applied to the facial image to extract its gradient features. Then, a Hadamard product operation is performed between the convolution features of the tongue image and the gradient features of the facial image to obtain the second intermediate feature. The Hadamard product operation can effectively suppress the interference of irrelevant regions such as highlights in the facial image on the tongue features. Next, a weight coefficient γ is dynamically generated based on the global content of the facial and tongue images.
[0049] Specifically, the facial image and tongue coating image are each subjected to global average pooling. The resulting two global feature vectors are concatenated and input into a multilayer perceptron. The output of the multilayer perceptron is mapped to a range of 0 to 1 using a sigmoid activation function, thus obtaining the weight coefficient γ. Finally, the first and second intermediate features are weighted and fused based on this weight coefficient, i.e., F... align = γ × First intermediate feature + (1-γ) × Second intermediate feature, to obtain the aligned cross-modal fusion feature. When the tongue image has severe reflection causing unclear texture, the dynamic weight coefficient γ will automatically increase, making the fusion result more dependent on the face image branch, and vice versa. This alignment step, by introducing gradient information and dynamic weights, achieves adaptive semantic alignment between the face and tongue images at the feature level, providing semantically consistent high-quality input for subsequent feature extraction.
[0050] In beverage recommendation tasks, individual identity features (such as innate skin color, facial scars, wrinkles, etc.) and genuine beverage preference features (such as changes in tongue coating and complexion related to physical constitution, etc.) are often mixed together. If the model cannot effectively distinguish between these two types of features, individual differences (such as skin color differences between different ethnic groups) may interfere with the recommendation results, for example, recommending a specific tea drink simply because the person has a darker skin tone, rather than based on their actual physical constitution. To address this issue, this application introduces a decoupling loss function during the training phase.
[0051] Specifically, during model training, in addition to the conventional classification loss, a decoupling loss function is introduced to optimize the entire dual-branch feature extraction network and its subsequent modules. This decoupling loss function has two design goals: first, to constrain the identity features of different samples from the same user to be consistent; and second, to constrain the beverage preference features of different users to be mutually exclusive. In the implementation, for each input image, the model not only outputs the user fusion feature F... fuse Furthermore, two independent encoders are used to extract identity features and beverage preference features respectively. The identity encoder, Enc_id(·), consists of two fully connected layers and a LeakyReLU activation function, used to extract feature vectors related to individual identity from F_fuse. The beverage encoder, Enc_drink(·), has the same network structure as the identity encoder, but its parameters are completely independent, used to extract feature vectors related to beverage preferences. During training, for multiple samples collected from the same user under different conditions (such as different lighting and different angles), the decoupling loss function forces the identity feature vectors obtained by these samples through the identity encoder to be as close as possible, i.e., constraining the consistency of identity features.
[0052] Meanwhile, for different users, the decoupling loss function forces their beverage feature vectors obtained through the beverage encoder to be as far apart as possible, thus maximizing the discriminative power of beverage features. In this way, the model is guided during the learning process to separate individual identity information and beverage preference information into two different feature subspaces, thereby effectively eliminating the interference of identity differences on beverage recommendations and making the recommendation results truly based on features related to physical constitution.
[0053] Training the dual-branch feature extraction network includes optimizing the dual-branch feature extraction network using a decoupling loss function. The decoupling loss function is used to constrain the identity features of different samples of the same user to be consistent, while simultaneously constraining the beverage preference features of different users to be far apart. When training the dual-branch feature extraction network, an orthogonal constraint loss function is used to force the feature space used to represent identity and the feature space used to represent beverage preferences to be orthogonal to each other. When training the dual-branch feature extraction network, a classification loss function is used to optimize the dual-branch feature extraction network. The classification loss function introduces a temperature coefficient that is dynamically generated based on the input user fusion features, which is used to soften the classification probability distribution when the sample features are blurred.
[0054] In beverage recommendation tasks, different beverage categories may exhibit high similarity. For example, different flavors of juice or different types of tea may have very similar user characteristics, easily leading to classification confusion. Furthermore, in practical applications, input images may suffer from quality issues such as blurriness, occlusion, and uneven lighting, making the extracted features less clear and further increasing the classification difficulty. To address this problem, this application introduces a dynamic temperature coefficient into the classification loss function during the training phase.
[0055] Specifically, during model training, the classification loss function used is the Softmax cross-entropy loss with a temperature coefficient τ. Unlike traditional fixed temperature coefficients, the temperature coefficient τ in this application is dynamically generated based on the user fusion feature F_fuse for each input sample. Specifically, the user fusion feature F_fuse is first subjected to global average pooling to obtain a one-dimensional feature vector. This vector is then input into a multilayer perceptron (MLP), and the output of the MLP is the temperature coefficient τ = MLP(AvgPool(F_fuse)). A multilayer perceptron typically consists of two fully connected layers and a non-linear activation function, whose parameters are optimized along with the entire network during training. The dynamically generated temperature coefficient τ adjusts the smoothness of the probability distribution output by the Softmax function. When the input sample features are blurry or of poor quality, such as when the image is occluded or unevenly lit, the model has lower confidence in classifying that sample, and in this case, the MLP will output a larger τ value. A large temperature coefficient makes the output probability distribution of the Softmax function "softer," meaning the probability differences between categories decrease, and the model will not make overly confident judgments about fuzzy samples too early, which helps to stabilize learning in the early stages of training.
[0056] Conversely, when the input samples have clear features and are easy to classify, the multilayer perceptron opportunistically outputs a smaller τ value, making the probability distribution of the Softmax output more "sharp" and enhancing the model's ability to discriminate easily classified samples. Through this adaptive adjustment mechanism, the model can better handle low-quality samples in complex scenarios, improving its robustness to multi-sample distributions in real-world environments and its classification accuracy.
[0057] Example 2 This invention proposes an artificial intelligence-based beverage recommendation system to implement the beverage recommendation method of Example 1, such as... Figure 2 As shown, the system specifically includes the following functional modules.
[0058] The image acquisition module is used to acquire images of the user's face and tongue. In one embodiment, the device can be integrated into a smart tea machine, including a high-definition RGB camera for capturing facial images and a high-resolution camera with macro capabilities for capturing tongue images. The cameras have autofocus and adaptive exposure adjustment functions to cope with different lighting environments and shooting distances, ensuring the acquisition of clear and standard images.
[0059] The cross-modal alignment module is used to perform cross-modal dynamic alignment processing on the acquired facial and tongue images to obtain aligned cross-modal fusion features. This module can be implemented in hardware using a GPU or a dedicated AI acceleration chip, and runs program code that implements the alignment algorithm in software.
[0060] The dual-branch feature extraction module includes a face branch and a tongue coating branch, which are used to extract facial depth features and tongue coating depth features from the aligned fused features, respectively. The face branch runs the algorithm implementing the multi-scale dilated convolution, while the tongue coating branch runs the algorithm implementing the group convolution and channel attention mechanism. The dynamic feature interaction fusion module uses a gated cross-attention mechanism to dynamically interact and fuse facial depth features and tongue coating depth features to generate user-fused features. This module runs the algorithm implementing the attention mechanism.
[0061] The recommendation output module calculates the probability of a user belonging to any beverage category based on the generated user fusion features using a classifier (e.g., a fully connected layer with a Softmax function), and outputs one or more beverages with the highest probability as recommendations. These recommendations can be displayed on a screen or sent to downstream execution modules (e.g., automated beverage maker) for automatic ingredient dispensing and preparation.
[0062] Through the collaborative work of the above modules, this system can convert facial and tongue images into accurate beverage recommendations, achieving a fully automated process from data collection to result output. Detailed explanations are as follows: S100, Data Acquisition and Labeling Module, with data acquisition sources including multimodal data of facial images and tongue coating images; The facial image data comes from professional facial image acquisition equipment, using high-definition cameras to capture static images, ensuring diversity under different lighting conditions during the acquisition process; Tongue coating image data is acquired through a high-resolution camera. Especially in capturing texture and color distribution, the shooting angle and distance of the tongue coating need to be precisely controlled to ensure the clarity and detail of the image. During the acquisition process, the equipment needs to have automatic focusing and adaptive exposure adjustment functions to cope with the impact of changes in light on image quality.
[0063] Data annotation methods rely on a combination of professional medical personnel and image processing technology; The annotation of facial images is mainly based on skin color features, facial expressions, and other visual information related to beverage recommendations. Annotators manually identify and annotate facial regions using image annotation tools. The annotation of tongue coating images focuses on features such as color distribution and texture details. Annotators need to classify and annotate based on subtle changes in the thickness, cracks, and color differences of the tongue coating.
[0064] In one embodiment, the labeled categories include different categories of beverages.
[0065] S200, Beverage Recommendation Model Training: This invention employs a deep multi-feature fusion convolutional neural network as a beverage recommendation model for facial and tongue images, such as... Figure 3 As shown, the training process is as follows: S201. Obtain labeled facial image samples and tongue coating image samples, wherein the labeling includes corresponding beverage category labels; S202. Perform cross-modal dynamic alignment processing on the facial image sample and the tongue coating image sample to obtain the aligned cross-modal fusion features; S203. Input the aligned cross-modal fusion features into a dual-branch feature extraction network. The dual-branch feature extraction network includes a face branch and a tongue branch. Use the face branch to perform multi-scale global feature extraction on the cross-modal fusion features to obtain facial depth features, and use the tongue branch to perform local texture feature extraction on the cross-modal fusion features to obtain tongue depth features. S204. A gated cross-attention mechanism is used to dynamically interact and fuse the facial depth features and the tongue coating depth features to generate user fused features. S205. Calculate the total loss using a decoupling loss function and a classification loss function, and update the parameters of the dual-branch feature extraction network based on the total loss; wherein, the decoupling loss function is used to constrain the identity features of different samples of the same user to be consistent while simultaneously constraining the beverage preference features of different users to be far apart, and the classification loss function is used to constrain the difference between the user fusion features and the beverage category label; S206. Repeat the iteration until the preset stopping condition is met to obtain the trained beverage recommendation model.
[0066] The following is a detailed explanation of the specific training process for the beverage recommendation model: 1. Cross-modal alignment preprocessing of face and tongue images Facial images exhibit light sensitivity and differences in facial expressions, while tongue images show density of color distribution and fine texture. Conventional feature fusion methods process two types of images independently or simply stitch together features, resulting in a lack of semantic association between modalities. This invention employs a cross-modal dynamic alignment strategy, achieving alignment through feature space projection and local gradient fusion. This allows for adaptive adjustment of the cross-modal fusion ratio based on image content, resolving the problem of missing modal associations. This is expressed as:
[0067] In the formula, Input a facial image; For the aligned cross-modal fusion feature tensor; Represents the dimension of data features. The length in pixels of the facial image. The width in pixels of the facial image; To modify the activation function of the linear unit; Input a tongue coating image; The length (in pixels) of the tongue coating image. The width (in pixels) of the tongue coating image; The Sobel gradient operator extracts local texture features (such as tongue cracks or facial edges). Tongue images have dense colors but significant texture differences (such as yellow cracks), requiring gradient operations to capture subtle changes. The Sobel gradient operator enhances local structural features by targeting the texture characteristics of tongue cracks (high gradient response) and facial edges (stable gradient). It is a feature extraction layer with a 3x3 convolutional kernel; To enhance cross-modal local feature matching through channel-wise cross-correlation calculations, channel correlation between facial convolutional features and tongue gradient is calculated (matching regions such as the correlation between lip color and tongue color). For Hadamard product operations, suppress irrelevant region responses and suppress interference from facial lighting interference regions (such as specular reflection) on tongue coating features; The weighting coefficients are dynamic; when the tongue coating image shows severe reflection, the weight of the tongue coating branch is reduced. Calculation, where It is the Sigmoid activation function. For splicing operations; It is a multilayer perceptron (containing two fully connected layers and an activation function). This is a global average pooling operation.
[0068] 2. Perform multi-scale bi-branch feature extraction.
[0069] Changes in facial lighting (such as differences between indoor and outdoor lighting) can interfere with global features (such as skin color assessment). Conventional single-branch convolutional neural networks struggle to simultaneously capture global facial semantics (such as skin color) and local details of the tongue coating (such as tongue coating granules). This invention constructs a dual-branch feature extraction network that satisfies a dual-branch heterogeneous convolution structure, including a face branch and a tongue branch. The face branch uses dilated convolution to expand the receptive field and extract illumination-invariant features. Multi-scale dilated convolution aggregates global facial illumination features, and expanding the receptive field eliminates the influence of sudden changes in local illumination (such as facial shadows caused by sidelight). This is represented as:
[0070] In the formula, For expansion rate Hollow convolution, For expansion rate, It captures local (pore details), mid-range (facial region segmentation), and global (overall skin tone) illumination-invariant features, namely, the three dilated convolutions of the facial branch ( Parallel computation, the output feature maps are summed and then batch normalized, represented as follows: ; For the first convolutional neural network Layer-specific branch output characteristics; For the first convolutional neural network Input features after cross-modal alignment; The batch normalization operation function is used to align the distribution of multi-scale features and alleviate the offset of feature distribution under different lighting conditions. The expansion rate in facial branches The dilated convolution kernel weight parameters.
[0071] Furthermore, subtle differences in tongue coating texture (such as the granular differences between thin white coating and thick greasy coating) require high-resolution feature extraction. The tongue coating branch uses group convolution and channel attention mechanisms to focus on local textures, and highlights important tongue coating textures by weighting channels, as shown below:
[0072] In the formula, For grouped convolution, the tongue coating has a dense color distribution (small RGB value differences). Grouping avoids channel confusion, and each group independently extracts specific color channels (e.g., the R channel is sensitive to yellow coating), enhancing the independence of local texture. Furthermore, the number of groups... To reduce the number of parameters and enhance the independence of local features; For the compression-excitation module, a compression-excitation mechanism is used to increase the weight of key texture channels (such as crack areas). For example, a higher response is given to the red channel of the yellow greasy coating, while suppressing irrelevant tongue background areas. Specifically, a compression operation is first performed, followed by global average pooling according to... The feature map is compressed using a compression ratio. Further, an activation operation is performed, generating channel weights through two fully connected layers. The activation function of both fully connected layers is the Sigmoid activation function. The weight parameters of the grouped convolution kernel in the tongue coating branch are denoted as .
[0073] 3. Dynamic feature interaction fusion Different beverage categories have different modal dependencies; for example, juice recommendations depend on facial skin tone, while traditional Chinese medicine teas depend on tongue coating texture. Traditional feature fusion methods use fixed weights to add or combine features, which cannot adapt to the differences in modal dependence among different beverage categories. This invention employs a gated cross-attention mechanism to achieve modal complementarity through bidirectional attention, and the gated mechanism can dynamically adjust the fusion direction according to the input, as shown below:
[0074]
[0075]
[0076] In the formula, , representing the query vector for the facial branch; , representing the key vector of the tongue coating branch; , represents the value vector of the tongue coating branch; , representing the value vector of the facial branch; Features resulting from dynamic interaction and fusion; For layer normalization operation; The projection weight matrix of the tongue coating branch key vector; The projection weight matrix is the vector of tongue coating branch values. The projection weight matrix for the facial branch query vector; The projection weight matrix is the facial branch value vector. The final feature of the facial branch; This is the final characteristic of tongue coating branches; For cross-modal attention weight matrix, calculate the attention weight of face to tongue coating. For example, if the face is pale, the blood color feature of the tongue coating is needed for assistance. As the inverse cross-modal attention weight matrix, the attention weights of the tongue coating on the face are calculated. When the tongue coating image is blurred, Automatically reduces the contribution of tongue coating branches, relying on facial branches; For the Softmax function; This is the scaling factor; It is a feedforward network; for and Feature splicing can be used to enhance the combination of beverage-related characteristics. For example, if the complexion is dull and the tongue coating is yellow and greasy, herbal teas are recommended.
[0077] 4. Calculate the decoupling loss function.
[0078] In beverage recommendation tasks, individual identity characteristics (such as innate skin color) and beverage preference characteristics (such as physical constitution-related needs) need to be separated; To eliminate the interference of individual differences (such as differences in facial skin color) on beverage recommendations, a decoupling loss function is used to separate identity-related features from beverage preference features. The loss function is calculated as follows:
[0079] In the formula, This is the decoupling loss function; Expressing expectations; Constraining the consistency of identity features, individual differences in facial images should not affect beverage recommendation results, forcing the identity encoding of different samples of the same user (such as faces under different lighting conditions) to be similar; Through negative The distance maximizes the difference in beverage features between different samples, forcing the network to separate two types of features, maximizing the distance between beverage features of different users, and strengthening constitution-related features (such as damp-heat constitution corresponding to specific beverages). Fusion features of different enhanced samples for the same user; It is an L2 norm; It is an identity encoder, consisting of two fully connected layers and LeakyReLU; It is a beverage encoder, with the same structure as the identity encoder, but with independent parameters; Features of different samples for the same user; 5. Calculate the classification loss function.
[0080] For beverage categories with high similarity (such as juice and tea), a learnable temperature coefficient is used to adjust the logits distribution to calculate the classification loss function, expressed as:
[0081] In the formula, For the first Logit values for each category; For the first Logit values for each category; The classification loss function; This represents the total number of beverage categories. For the first Each category has a unique hot-coded label; The temperature coefficient is dynamically generated based on the input features, when the sample features are blurred (e.g., due to light occlusion). Increase the value to soften the probability distribution, and conversely decrease it to enhance classification confidence. It is an exponential function with the natural constant as its base.
[0082] 6. Perform gradient-guided weight decay.
[0083] Traditional weight decay uniformly penalizes all parameters, leading to over-constraint of important feature layers. This invention employs a gradient-sensitive decay coefficient to apply stronger regularization to layers with large gradients (such as shallow convolutions), suppressing overfitting, and reducing constraints on deep classifiers while preserving discriminative ability. This is expressed as:
[0084] In the formula, Here is the loss function for the convolutional neural network, calculated as follows: and Weighted summation; For the first convolutional neural network The gradient-sensitive weight decay coefficient of the layer avoids the fixed assumption of decay coefficient by traditional optimizers such as Adam. Strong decay is applied to shallow convolution (such as edge detection layer) to prevent overfitting of texture noise (such as facial hair), while weak decay is applied to deep classification layer to preserve beverage discrimination ability. Base attenuation rate; The loss function of a convolutional neural network with respect to the th The gradient of the layer weight parameters; This represents the number of convolutional layers in the convolutional neural network. The loss function of a convolutional neural network with respect to the th Layer parameter gradient.
[0085] 7. Update the parameters of the convolutional neural network.
[0086] A gradient-sensitive adaptive weight decay strategy is adopted, and the parameter update formula is as follows:
[0087] In the formula, is the learning rate of the convolutional neural network; For the first convolutional neural network Layer weight parameters; For parameter update operations; For the orthogonal constraint loss function with respect to the th The gradient of the layer weight parameters.
[0088] 8. Optimize multi-task learning.
[0089] The weights of the decoupling loss and classification loss are dynamically adjusted to avoid feature confusion in the early stages of training, as follows:
[0090]
[0091] In the formula, For the first The decoupling loss weight for the next iteration; For classification loss weights; The weight scheduling slope factor controls the weight growth rate; preferably, it is set to 0.3. This represents the total number of iterations.
[0092] Furthermore, the computation method of a convolutional neural network is as follows: .
[0093] 9. Perform feature decoupling orthogonality constraints.
[0094] To enhance the independence of identity features and beverage features, the two types of features are further decoupled by minimizing the correlation between their feature spaces. The orthogonality loss function is calculated as follows:
[0095] In the formula, The orthogonal constraint loss function; It is the Frobenius norm. The identity and beverage feature spaces are forced to be orthogonal to avoid identity-related variables (such as facial scars) from contaminating beverage features; This is a batch identity feature matrix; For feature dimensions; This refers to the number of data items in the batch. This is a feature matrix for batch beverages.
[0096] 10. Repeat the above steps iteratively until a preset stopping iteration condition is met, indicating that the model training is complete and a beverage recommendation model is obtained. In one embodiment, the preset stopping iteration condition is reaching a preset maximum number of iterations; preferably, the preset maximum number of iterations is set to 1000.
[0097] S300, Beverage Recommendations In the beverage recommendation inference stage, facial and tongue images are first preprocessed and then input into a dual-branch feature extraction network (convolutional neural network) for multi-scale feature extraction and dynamic feature interaction fusion, ultimately generating the user's beverage recommendation features. By decoupling the calculation of the loss function and the classification loss function, the model ensures that the recommendation results accurately reflect individual differences and beverage preferences. During the inference process, based on the user's characteristics (such as skin color, tongue texture, and body type), the system automatically generates a beverage category suitable for the user and outputs the corresponding recommendation results.
[0098] In one embodiment, the following verification is performed to verify the effectiveness of this technology.
[0099] To analyze the impact of different cross-modal alignment strategies on model performance and verify the superiority of dynamic alignment strategies in establishing semantic associations between modalities, this technique is compared with traditional feature concatenation and independent processing methods. The effectiveness of the strategies is analyzed by examining the accuracy changes during model training. The experimental results are as follows: Figure 4 The results show that the technique using dynamic weight adjustment can reach a stable convergence state faster and the final accuracy is significantly higher than that of traditional methods. This indicates that the adaptive alignment mechanism of feature space projection and local gradient fusion effectively solves the problem of missing modal associations and enhances the semantic consistency of cross-modal features.
[0100] This study analyzes multi-scale feature extraction structures, compares the performance differences of different network architectures in global semantic capture and local detail extraction, and compares the heterogeneous dual-branch structure of this technique with single-branch convolutional neural networks and conventional dual-branch models. The study evaluates metrics such as global feature score, local feature score, and inference speed. Experimental results are as follows: Figure 5 As shown, the results demonstrate that this application achieves the best results in both global illumination feature modeling and local texture focusing by combining dilated convolution with channel attention, while keeping the inference speed within an acceptable range. This verifies the promoting effect of heterogeneous structure design on multi-scale feature fusion.
[0101] To analyze the effect of dynamic temperature coefficient on classification performance, this study compares the fixed temperature coefficient strategy with the adaptive adjustment method of our technique, analyzes the robustness of the model on samples of different quality, and simulates two scenarios: clear samples and blurred samples. Figure 6 As shown, the experimental results demonstrate that this technique significantly improves classification accuracy on blurry samples, indicating that the learnable temperature coefficient can dynamically soften the probability distribution according to feature complexity, effectively alleviating the problem of decreased classification confidence caused by image occlusion or uneven lighting, and enhancing the model's adaptability to complex samples in real-world scenarios.
[0102] To verify the actual effect of the feature decoupling loss function, the spatial separation degree of identity features and beverage features is analyzed by comparing joint loss training strategies. Figure 7 As shown, the experiment uses kernel density estimation to visualize the distribution difference of the two types of feature distances. It is found that this technique can significantly reduce the feature distance between different samples with the same identity, while expanding the feature difference between different beverage categories. This proves that the decoupling loss, by constraining the consistency of identity features and maximizing the distinguishability of beverage features, successfully removes the interference of individual differences on the recommendation results, and improves the purity and discriminative ability of beverage preference features.
[0103] Furthermore, this invention demonstrates that the proposed cross-modal dynamic alignment strategy aligns facial and tongue images through feature space projection and local gradient fusion, adaptively adjusting the cross-modal fusion ratio. This effectively solves the problem of missing modal correlations between facial and tongue images. Compared to traditional feature stitching or independent processing methods, this invention's dynamic alignment strategy more accurately captures semantic relationships between modalities, improving data fusion performance. Traditional single-branch convolutional neural networks struggle to simultaneously capture global features of facial images and local details of tongue images. Therefore, this invention designs a dual-branch convolutional neural network. The facial branch uses dilated convolution to expand the receptive field and capture illumination-invariant features; the tongue branch uses group convolution and channel attention mechanisms to focus on local texture features of the tongue. This multi-scale, multi-branch structural design effectively enhances feature extraction capabilities.
[0104] This invention employs a gated cross-attention mechanism to achieve dynamic interactive fusion of facial and tongue image features through bidirectional attention. Different beverages have varying degrees of dependence on facial and tongue images, and the gating mechanism can dynamically adjust the direction of modality fusion based on input features, thereby improving the accuracy and adaptability of beverage recommendations. This invention uses a decoupling loss function to ensure the separation of identity features (such as skin color and facial features) from beverage preference features (such as physical requirements), thereby improving the accuracy of beverage recommendations. By constraining the consistency of identity features across different samples of the same user, while maximizing the differences in beverage features, the interference of identity differences on beverage recommendations is eliminated.
[0105] Furthermore, when dealing with beverages with high category similarity (such as juice and tea), traditional classification loss functions typically rely on a fixed temperature coefficient. This invention, however, employs a learnable temperature coefficient to dynamically adjust the logits distribution. When sample features become ambiguous, the temperature coefficient increases to soften the classification distribution, improving the model's robustness in complex situations. The proposed gradient-sensitive adaptive weight decay strategy applies different strengths of regularization to gradients in different layers. Shallow convolutional layers (such as edge detection layers) receive stronger decay to prevent overfitting to texture noise, while deeper classification layers are subject to weaker constraints to maintain their discriminative power.
[0106] like Figure 8 As shown, a third objective of this application embodiment is to provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the aforementioned AI-based beverage recommendation method. The device also includes a communication interface and a bus.
[0107] The fourth objective of this application is to provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned AI-based beverage recommendation method.
[0108] A fifth objective of this application is to provide a computer program product comprising computer instructions that instruct a computer to execute the aforementioned artificial intelligence-based beverage recommendation method.
[0109] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0110] These computer program instructions can also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0111] This application may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, readable storage media, optical storage, etc.) containing computer-usable program code.
[0112] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0113] Obviously, the described embodiments are only some, not all, of the embodiments in this application. All other embodiments obtained by those skilled in the art based on the embodiments in this application without inventive effort should fall within the scope of protection of this application.
[0114] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and not to limit them. Although this application has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation methods of this application. Any modifications or equivalent substitutions that do not depart from the spirit and scope of this application should be covered within the protection scope of this application.
Claims
1. A beverage recommendation method based on artificial intelligence, characterized in that, Includes the following steps: Collect user's facial and tongue images; The facial image and the tongue image are subjected to cross-modal dynamic alignment processing to obtain aligned cross-modal fusion features; The aligned cross-modal fusion features are input into a dual-branch feature extraction network, which includes a face branch and a tongue branch. The face branch is used to perform multi-scale global feature extraction on the cross-modal fusion features to obtain facial depth features, and the tongue branch is used to perform local texture feature extraction on the cross-modal fusion features to obtain tongue depth features. The face branch uses multi-scale dilated convolution to extract illumination-invariant features, and the tongue branch uses group convolution and channel attention mechanisms to focus on local fine texture features. A gated cross-attention mechanism is used to dynamically interact and fuse the facial depth features and the tongue coating depth features to generate user fused features. Based on the user fusion characteristics, the corresponding beverage recommendation results are generated and output.
2. The beverage recommendation method based on artificial intelligence according to claim 1, characterized in that, The cross-modal dynamic alignment processing of the facial image and the tongue image includes: Extract the convolutional features of the facial image and the gradient features of the tongue image, and perform channel-wise cross-correlation operation on the two to obtain the first intermediate feature; Extract the convolution features of the tongue image and the gradient features of the facial image, and perform a Hadamard product operation on the two to obtain the second intermediate feature; Based on the content of the facial image and the tongue image, weight coefficients are dynamically generated, and the first intermediate feature and the second intermediate feature are weighted and fused based on the weight coefficients to obtain the aligned cross-modal fusion feature.
3. The artificial intelligence-based beverage recommendation method according to claim 2, characterized in that, The dynamic generation of weight coefficients includes generating the weight coefficients using a multilayer perceptron based on global pooling features of the facial image and the tongue image.
4. The beverage recommendation method based on artificial intelligence according to claim 1, characterized in that: The facial branch employs multi-scale dilated convolution, which performs parallel convolution on the input features with multiple different dilation rates and fuses the convolution results to extract global illumination invariant features containing different receptive fields. The facial branches utilize multi-scale dilated convolution to extract features in the following manner: In the formula, For expansion rate Hollow convolution, For expansion rate, To capture local, mid-range, and global illumination invariance features respectively; For the first convolutional neural network Layer-specific branch output characteristics; For the first convolutional neural network Input features after cross-modal alignment; This is a batch normalization operation function used to align the distribution of multi-scale features. The expansion rate in facial branches The dilated convolution kernel weight parameters; The tongue coating branch employs a grouped convolution and channel attention mechanism. It extracts local texture information of the input features through grouped convolution and assigns weights to different channels of the feature map after grouped convolution through a channel attention layer in order to focus on key texture features. The tongue coating branches are extracted using the following methods: In the formula, For the first convolutional neural network Layered tongue coating branch output features For grouped convolution, For compression-excitation module; The weight parameters of the grouped convolution kernel in the tongue coating branch are denoted as .
5. The artificial intelligence-based beverage recommendation method according to claim 1, characterized in that, The method of using a gated cross-attention mechanism for dynamic interaction fusion includes: The facial depth features are projected into a query vector, and the tongue coating depth features are projected into a key vector and a value vector to calculate the attention weight of the face to the tongue coating. The tongue coating depth features are projected into a query vector, and the facial depth features are projected into a key vector and a value vector to calculate the attention weight of the tongue coating on the face. Based on the attention weight of the face to the tongue coating and the attention weight of the tongue coating to the face, the value vectors of the facial depth features and the tongue coating depth features are weighted and fused to generate the user fusion features.
6. The artificial intelligence-based beverage recommendation method according to claim 1, characterized in that, Training the dual-branch feature extraction network includes optimizing the dual-branch feature extraction network using a decoupling loss function. The decoupling loss function is used to constrain the identity features of different samples of the same user to be consistent, while simultaneously constraining the beverage preference features of different users to be far apart. The decoupling loss function is expressed as follows: In the formula, This is the decoupling loss function; Expressing expectations; Constrain the consistency of identity characteristics; Fusion features of different enhanced samples for the same user; It is an L2 norm; It is an identity encoder, consisting of two fully connected layers and LeakyReLU; For beverage encoders; Features of different samples from the same user.
7. The artificial intelligence-based beverage recommendation method according to claim 1, characterized in that, When training the dual-branch feature extraction network, an orthogonal constraint loss function is used to force the feature space used to represent identity and the feature space used to represent beverage preferences to be orthogonal to each other. The orthogonal constraint loss function is expressed as: In the formula, The orthogonal constraint loss function; It is the Frobenius norm. The identity and beverage feature spaces are forced to be orthogonal to avoid identity-related variables (such as facial scars) from contaminating beverage features; This is a batch identity feature matrix; For feature dimensions; This refers to the number of data items in the batch. This is a feature matrix for batch beverages.
8. The artificial intelligence-based beverage recommendation method according to claim 1, characterized in that, When training the dual-branch feature extraction network, a classification loss function is used to optimize the dual-branch feature extraction network. The classification loss function introduces a temperature coefficient that is dynamically generated based on the input user fusion features, which is used to soften the classification probability distribution when the sample features are blurred. The classification loss function is expressed as follows: In the formula, For the first Logit values for each category; For the first Logit values for each category; The classification loss function; This represents the total number of beverage categories. For the first Each category has a unique hot-coded label; The temperature coefficient is dynamically generated based on the input features; It is an exponential function with the natural constant as its base.
9. A training method for an artificial intelligence-based beverage recommendation model, characterized in that, include: Obtain labeled facial image samples and tongue coating image samples, wherein the labels include corresponding beverage category tags; The facial image samples and the tongue coating image samples are subjected to cross-modal dynamic alignment processing to obtain aligned cross-modal fusion features; The aligned cross-modal fusion features are input into a dual-branch feature extraction network. The dual-branch feature extraction network includes a face branch and a tongue branch. The face branch is used to perform multi-scale global feature extraction on the cross-modal fusion features to obtain facial depth features, and the tongue branch is used to perform local texture feature extraction on the cross-modal fusion features to obtain tongue depth features. A gated cross-attention mechanism is used to dynamically interact and fuse the facial depth features and the tongue coating depth features to generate user fused features. The total loss is calculated using a decoupling loss function and a classification loss function, and the parameters of the dual-branch feature extraction network are updated based on the total loss. The decoupling loss function is used to constrain the identity features of different samples of the same user to be consistent while simultaneously constraining the beverage preference features of different users to be far apart. The classification loss function is used to constrain the difference between the user fusion features and the beverage category label. Repeat the iteration until the preset stopping condition is met to obtain the trained beverage recommendation model.
10. An artificial intelligence-based beverage recommendation system, used to execute the artificial intelligence-based beverage recommendation method according to any one of claims 1 to 8, characterized in that, include: The image acquisition module is used to acquire images of the user's face and tongue. A cross-modal alignment module is used to perform cross-modal dynamic alignment processing on the facial image and the tongue image to obtain aligned cross-modal fusion features; The dual-branch feature extraction module includes a face branch and a tongue branch. The face branch is used to perform multi-scale global feature extraction on the cross-modal fusion features to obtain facial depth features, and the tongue branch is used to perform local texture feature extraction on the cross-modal fusion features to obtain tongue depth features. The dynamic feature interaction fusion module is used to dynamically interact and fuse the facial depth features and the tongue coating depth features using a gated cross-attention mechanism to generate user fused features. The recommendation output module is used to generate and output corresponding beverage recommendation results based on the user fusion features.