Cross-domain small sample object image classification method and device for intelligent terminal
By employing a dual Riemannian manifold processing module and a lightweight student network model with topology-preserving knowledge distillation on a smart terminal, the problems of insufficient object feature representation and poor cross-domain adaptability in object recognition are solved, achieving high-precision, low-latency object recognition and adaptation to complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QINGDAO GUOCHUANG INTELLIGENT HOME APPLIANCES RES INSTITU
- Filing Date
- 2026-04-28
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies for object recognition in smart terminals such as smart refrigerators/freezers suffer from insufficient object feature representation capabilities, low fine-grained recognition accuracy, poor cross-domain adaptability, insufficient model lightweighting, inability to adapt to edge deployment, low utilization of unlabeled data, and high difficulty in iterative optimization.
A dual Riemannian manifold processing module is used to fuse Euclidean and hyperbolic space features. Combined with topology-preserving knowledge distillation and progressive self-training, a lightweight student network model is used to achieve high-precision object recognition on smart terminals. This model eliminates redundant modules through joint training of the feature extraction module, the dual Riemannian manifold processing module, and the classification layer, making it suitable for low-computing-power requirements at the edge.
It achieves high-precision, low-latency object recognition on smart terminals, improves recognition accuracy and generalization ability in small sample scenarios, adapts to complex environments, effectively solves the problems of uneven illumination and domain offset, and makes efficient use of unlabeled data.
Smart Images

Figure CN122116010B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of smart home appliance technology, for example to a cross-domain small sample object image classification method and apparatus for smart terminals. Background Technology
[0002] With the rapid development of IoT and smart home technologies, smart refrigerators / freezers, as core smart appliances in the kitchen, have seen their object recognition and management functions become a key technological direction for improving user experience. Object recognition technology for smart refrigerators / freezers is mostly based on traditional convolutional neural network models, deployed to smart refrigerators / freezers after simple fine-tuning of models pre-trained on general image datasets. However, in practical applications, this type of technology faces many bottlenecks, making it difficult to meet the actual usage needs of smart refrigerators / freezers. These bottlenecks mainly manifest in the following aspects: insufficient object feature representation capabilities and low fine-grained recognition accuracy; poor cross-domain adaptability and weak generalization ability in real-world scenarios; insufficient model lightweighting, making it unable to adapt to edge deployment requirements; and low utilization of unlabeled data, making model iteration and optimization difficult.
[0003] This paper discloses a multi-domain few-shot classification method based on knowledge distillation. It utilizes the teacher-student network framework within knowledge distillation for effective knowledge transfer, introduces meta-learning training strategies into knowledge distillation, and provides rich and effective knowledge to the student network through task-oriented knowledge distillation and collaboration among multiple teacher networks, ensuring the student network's rapid adaptability to few-shot tasks. By introducing multi-level knowledge distillation, extracting the output predictions and sample relationships of the teacher networks as supervisory information, it guides the training of the student network from different perspectives, thus improving the efficiency of knowledge distillation.
[0004] In the process of implementing the embodiments of this disclosure, at least the following problems were found in the related art:
[0005] While cross-domain knowledge transfer is achieved through multi-teacher networks, meta-learning, and multi-level knowledge distillation, this solution still cannot simultaneously address the precise representation of object features, the generalization ability in cross-domain small sample scenarios, the lightweight deployment requirements at the edge, and the efficient utilization of massive amounts of unlabeled data in the actual scenario of object recognition on intelligent terminals.
[0006] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention
[0007] To provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended as a general commentary, nor is it intended to identify key / important components or describe the scope of protection of these embodiments, but rather as a prelude to the detailed description that follows.
[0008] This disclosure provides a method and apparatus for cross-domain small sample object image classification for smart terminals, so as to achieve high-precision, fine-grained recognition of objects in real-world scenarios of smart terminals while ensuring low computing power and low latency operation.
[0009] In some embodiments, the method includes:
[0010] The process involves acquiring an image of the object to be identified; processing the image using a feature extraction module to extract basic feature vectors; fusing the basic feature vectors using a bi-Riemannian manifold processing module to generate a comprehensive feature representation; and outputting the object classification and recognition result based on the comprehensive feature representation through a classification layer. Specifically, the bi-Riemannian manifold processing module uses Euclidean and hyperbolic branches to map the basic feature vectors and fuses the mapped features. The feature extraction module, bi-Riemannian manifold processing module, and classification layer constitute a lightweight student network model. This lightweight student network model is obtained by topology-preserving knowledge distillation and progressive self-training followed by the removal of the teacher network.
[0011] In some embodiments, the apparatus includes a processor and a memory storing program instructions, the processor being configured to, when executing the program instructions, perform the aforementioned cross-domain few-sample object image classification method for a smart terminal.
[0012] The cross-domain small-sample object image classification method and apparatus for smart terminals provided in this disclosure can achieve the following technical effects:
[0013] By fusing Euclidean and hyperbolic space features through a dual Riemannian manifold processing module, the model accurately represents the boundaries of major object categories and fine-grained hierarchical semantics. Combined with topology-preserving knowledge distillation, it achieves efficient cross-domain knowledge transfer, significantly improving recognition accuracy and generalization ability in small-sample scenarios. Through progressive self-training and closed-loop iterative optimization, the model achieves deep adaptation to the real-world complex environments of smart terminals. The lightweight student network obtained after removing the teacher network after training is adapted to the low computing power and low storage requirements of smart terminal edge computing, thereby achieving local, low-latency, high-precision classification and recognition of object images, effectively solving practical problems such as uneven illumination and domain offset within smart terminals.
[0014] The above general description and the description below are exemplary and illustrative only and are not intended to limit this application. Attached Figure Description
[0015] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrations and drawings do not constitute a limitation on the embodiments. Elements having the same reference numerals in the drawings are shown as similar elements. The drawings are not to be scaled. And wherein:
[0016] Figure 1 This is a schematic diagram of a cross-domain few-sample object image classification method for smart terminals provided in an embodiment of this disclosure;
[0017] Figure 2 This is a schematic diagram of a method for obtaining a lightweight student network model provided in the embodiments of this disclosure;
[0018] Figure 3 This is a schematic diagram of a method for fine-tuning an initialized joint training architecture, as provided in the embodiments of this disclosure.
[0019] Figure 4 This is a schematic diagram of a stable dual Riemannian manifold processing module in the method provided in the embodiments of this disclosure;
[0020] Figure 5 This is a schematic diagram illustrating the construction of class prototype centers corresponding to Euclidean space and hyperbolic space in the method provided in this embodiment of the disclosure;
[0021] Figure 6 This is a schematic diagram illustrating the optimization of network parameters of the dual Riemannian manifold processing module in the method provided in this embodiment of the disclosure;
[0022] Figure 7 This is a schematic diagram illustrating the construction of distillation loss in the method provided in the embodiments of this disclosure;
[0023] Figure 8 This is a schematic diagram illustrating the progressive self-training of the fine-tuned joint training architecture using a target domain sample set in the method provided in this embodiment of the disclosure.
[0024] Figure 9 This is a schematic diagram of another cross-domain few-sample object image classification method for smart terminals provided in this embodiment of the disclosure;
[0025] Figure 10 This is a schematic diagram of a cross-domain small sample object image classification device for a smart terminal provided in an embodiment of this disclosure;
[0026] Figure 11 This is a schematic diagram of a smart terminal provided in an embodiment of this disclosure. Detailed Implementation
[0027] To provide a more detailed understanding of the features and technical content of the embodiments of this disclosure, the implementation of the embodiments of this disclosure will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for illustrative purposes only and are not intended to limit the embodiments of this disclosure. In the following technical description, for ease of explanation, several details are used to provide a full understanding of the disclosed embodiments. However, one or more embodiments may still be implemented without these details. In other cases, well-known structures and devices may be simplified in their depiction to simplify the drawings.
[0028] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this disclosure described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion.
[0029] Unless otherwise stated, the term "multiple" means two or more.
[0030] In this embodiment of the disclosure, the character " / " indicates that the objects before and after it are in an "or" relationship. For example, A / B means: A or B.
[0031] The term "and / or" describes an association between objects, indicating that three relationships can exist. For example, A and / or B means: A or B, or A and B.
[0032] The term "correspondence" can refer to an association or binding relationship. The correspondence between A and B means that there is an association or binding relationship between A and B.
[0033] In this embodiment, a smart terminal refers to a home appliance product formed by introducing microprocessors, sensor technology, and network communication technology into home appliances. It possesses characteristics of intelligent control, intelligent sensing, and intelligent applications. The operation of a smart terminal often relies on the application and processing of modern technologies such as the Internet of Things (IoT), the Internet, and electronic chips. For example, smart home appliances can be connected to electronic devices to enable users to remotely control and manage the smart terminal. Specifically, the smart terminal has a built-in edge processor, image acquisition device, and memory. A lightweight student network model is compiled by a neural network compiler and burned into the memory. When the smart terminal triggers the image acquisition device to acquire images of objects within the smart terminal, the lightweight student network model is called to perform forward inference to identify the object category and thus manage the object. Smart terminals include, but are not limited to, smart refrigerators, smart freezers, smart preservation cabinets, and other smart home appliances. Objects include food ingredients, beverages, and other non-food ingredients such as daily necessities and pharmaceuticals. Optionally, objects mainly include food ingredients and beverages.
[0034] Before explaining the methods described below, the network environment and hyperparameters are initialized. To balance the model's representational power with the computational constraints of subsequent edge devices, the training batch size in the examples below is configured to 32, the adaptive moment estimator (Adam) optimizer is used to optimize the network weights, and the initial learning rate is set to 5 × 10⁻⁶. -4 And introduce a weight decay coefficient of 1×10 -4 By penalizing the L2 norm of the parameter matrix, overfitting of the model can be suppressed.
[0035] Combination Figure 1 As shown, this disclosure provides a cross-domain few-sample object image classification method for smart terminals, including:
[0036] S101, the processor acquires an image of the object to be identified.
[0037] S102, the processor uses the feature extraction module to process the image of the object to be identified and extract the basic feature vector.
[0038] S103, the processor uses the dual Riemannian manifold processing module to process and fuse the basic feature vectors to generate a comprehensive feature representation.
[0039] S104, the processor outputs object classification and recognition results through classification layer processing based on comprehensive feature representation.
[0040] Specifically, the Euclidean and hyperbolic branches of the dual Riemannian manifold processing module are used to map the basic feature vectors, and the features obtained from the mapping are fused. The feature extraction module, the dual Riemannian manifold processing module, and the classification layer constitute a lightweight student network model. The lightweight student network model is obtained by topology-preserving relation knowledge distillation and progressive self-training followed by the removal of the teacher network.
[0041] Here, the image of the object to be identified is acquired through the built-in image acquisition device of the smart terminal. This image is a real-world image of the object, i.e., food, in the context of the smart terminal's actual use, and is adaptable to complex lighting conditions, shooting angles, and other environmental features within the smart terminal. The feature extraction module in a lightweight student network model processes the object image to extract the underlying spatial-spectral fusion feature vector. This feature extraction module can be a deep convolutional neural network. The object image undergoes convolution operations, feature purification, and global average pooling operations within the deep convolutional neural network, outputting a one-dimensional basic feature vector with uniform dimensions. While extracting basic visual features such as object texture and edges, the feature extraction module also eliminates interference from ambient lighting and the variance of object position translation within the smart terminal.
[0042] The basic feature vectors are input into the dual Riemannian manifold processing module. The Euclidean space branch in the dual Riemannian manifold processing module outputs flat spatial features, i.e., Euclidean space features, through regular convolution and fully connected layers. The hyperbolic space branch in the dual Riemannian manifold processing module maps the features to a hyperbolic manifold through exponential mapping, outputting hyperbolic space features representing the hierarchical structure of food ingredients. The features output from the two branches are concatenated and semantically aligned to generate a comprehensive feature representation. By achieving collaborative representation of Euclidean and hyperbolic spaces through the dual Riemannian manifold processing module, the boundaries of major food categories are delineated using Euclidean space, while the negative curvature of hyperbolic space is used to embed the tree-like hierarchical structure of food ingredients with low distortion, maximizing the information value of limited samples and significantly improving the fine-grained food identification capability in small sample scenarios.
[0043] The comprehensive feature representation is input into the classification layer, and the Euclidean distance and hyperbolic geodesic distance between the comprehensive feature representation and the prototype center of each food category are calculated to generate a category prediction probability distribution. The category corresponding to the maximum probability is taken as the food classification and recognition result, and the recognition result is fed back to the smart terminal for object management. In this embodiment, the lightweight student network model has an inference latency of less than 100 milliseconds at the edge of the smart terminal, which can quickly complete food recognition. Moreover, in a 5-shot small sample cross-domain scenario, the food recognition accuracy is improved by more than 15% compared with existing methods, and the robustness to complex environments within the smart terminal is significantly enhanced.
[0044] The lightweight student network model is achieved through a two-stage joint training process: topology-preserving knowledge distillation and progressive self-training. After training, the teacher network and related auxiliary modules are removed, and the final deployable lightweight model is obtained through quantization. Specifically, topology-preserving knowledge distillation is used to fine-tune the pre-trained and initialized teacher and student networks. The pre-training initialization of both networks is based on the source domain sample set. After pre-training, the initial weights of the teacher network are frozen as the source domain knowledge carrier, and the student network is used as the parameter optimization object. In the fine-tuning stage, labeled samples from the target domain sample set are used to fine-tune the initialized teacher and student networks. Small-sample fine-tuning based on topology-preserving knowledge distillation forces the student network to retain the source domain feature topology of the teacher network, updating only the student network parameters while the teacher network does not participate in gradient updates. This preserves the robust feature topology of the pre-trained source domain, effectively avoiding catastrophic forgetting and overfitting during small-sample cross-domain fine-tuning. Even in scenarios with complex lighting and varying shooting distances within smart terminals, high recognition accuracy is maintained.
[0045] Then, the fine-tuned teacher and student networks are self-trained using the target domain sample set to further optimize the student network. Specifically, the teacher network predicts unlabeled samples in the target domain sample set to assign pseudo-labels; the student network parameters are updated using pseudo-labeled samples and labeled samples. Furthermore, the teacher network parameters are smoothly updated by the student network using an exponential moving average (EMA) to filter out noise gradients caused by erroneous pseudo-labels and improve the prediction accuracy of the teacher network. This process is iterated until the student network converges, resulting in a trained student network. Redundant teacher networks and training auxiliary modules in the inference stage of the trained student network are removed, retaining only the feature extraction module, the bi-Riemannian manifold processing module, and the classification layer. The retained network structure is quantized and compiled by a neural network compiler to obtain a lightweight student network model that can be deployed on the edge of smart terminals. Thus, through progressive self-training, without increasing computational costs or relying on manual annotation, the massive amounts of unlabeled data from smart terminals are efficiently and evenly utilized to achieve a progressive improvement in the model's target domain recognition accuracy and robustness. This allows the model to retain the robust topology of the source domain while being deeply adapted to the real-world, complex scenarios of smart terminals. Ultimately, the output is a student network model that meets edge deployment requirements and enables high-precision food identification.
[0046] Furthermore, the source domain sample set refers to a collection of food image samples with complete category labeling information, collected in a standardized laboratory under controlled conditions. The target domain sample set refers to a collection of food image samples collected in the physical environment inside a user's actual smart terminal. This is the sample set of the actual application scenarios that the model ultimately needs to adapt to. This sample set generally contains a very small subset of labeled samples and a large subset of unlabeled samples. For example, the very small subset of labeled samples is set as a 5-shot small sample, representing newly added food items that the user has just put into the smart terminal and that the smart terminal has never seen before, with only 5 images for each category.
[0047] This disclosure presents a cross-domain few-sample object image classification method for smart terminals. It fuses Euclidean and hyperbolic space features through a dual Riemannian manifold processing module to accurately represent object category boundaries and fine-grained hierarchical semantics. Combined with topology-preserving relation knowledge distillation, it achieves efficient cross-domain knowledge transfer, significantly improving recognition accuracy and generalization ability in few-sample scenarios. Through progressive self-training and closed-loop iterative optimization, it achieves deep adaptation of the model to the real and complex environment of smart terminals. The lightweight student network obtained after removing the teacher network after training is adapted to the low computing power and low storage requirements of smart terminal edge computing. This enables local, low-latency, high-precision classification and recognition of object images, effectively solving practical problems such as uneven illumination and domain offset within smart terminals.
[0048] Optionally, in step S102, before the processor processes the image of the object to be identified using the feature extraction module, it includes preprocessing the food image by Z-score normalization.
[0049] Here, the food images are first parsed into high-dimensional tensors. ,in C This refers to the number of color channels; in this embodiment, it is RGB three channels. H and W Representing the adjusted unified spatial resolution, R is the set of real numbers constituting the image data. To eliminate the severe impact of complex internal environments of smart terminals, such as frost occlusion and alternating warm and cold LED light sources, on the activation values of the underlying filters in the feature extraction module, channel-level Z-score normalization is performed on the image tensor:
[0050] .
[0051] in, These are the standardized pixel values. These are the original pixel values. Let be the mean of the m-th channel. Let be the standard deviation of the m-th channel. Then, the preprocessed batch tensor is input into the feature extraction module to extract the basic feature vector.
[0052] Optionally, in step S102, the processor uses the feature extraction module to process the image of the object to be identified, extracting basic feature vectors, including:
[0053] The processor purifies features through multiple cascaded residual blocks of the feature extraction module to obtain a high-dimensional abstract semantic feature map.
[0054] The processor uses a global average pooling operator to integrate and average the high-dimensional abstract semantic feature map, and outputs the underlying one-dimensional space-spectral fusion basic feature vector.
[0055] Here, the feature extraction module is a deep convolutional residual network. Feature extraction is performed through multiple cascaded residual blocks at the bottom layer of the deep convolutional residual network: each residual block sequentially performs two-dimensional convolution, batch normalization, and non-linear activation operations, and so on. l The forward propagation expression for the layer residual block is: .in, H (l 1) For the first l-1 The input features of the layer For the first l The weight matrix of a two-dimensional convolution layer, where × denotes the convolution operation. It is a non-linear activation function. This represents batch normalization. Through layer-by-layer downsampling of residual blocks, a high-dimensional abstract semantic feature map is extracted. At the end of the backbone network of the deep convolutional residual network, a global average pooling operator is used to normalize the feature maps of dimension 1. B × d in × h × w The high-dimensional abstract semantic feature map is integrated and averaged along the height and width dimensions to eliminate the translation variance caused by the positional shift of the ingredients. The output dimension is... d in The underlying one-dimensional space-spectral fusion basic feature vector .in, B Batch Size refers to the number of food images that are input into the model at the same time in a single batch. For example, it can be set to 32. h , w These represent the height and width of the feature map, respectively.
[0056] It should be noted that the samples in the target domain and source domain are also processed and extracted as described above during the acquisition of the lightweight student network model.
[0057] Optionally, in step S103, the processor uses a dual Riemannian manifold processing module to process and fuse the basic feature vectors to generate a comprehensive feature representation, including:
[0058] In the Euclidean space branch, the processor preserves or linearly transforms the fundamental eigenvectors to obtain Euclidean space features.
[0059] In the hyperbolic space branch, the processor projects the basic feature vectors into the hyperbolic space through the Riemann index mapping to obtain hyperbolic space features.
[0060] The processor aligns and merges the Euclidean space features and hyperbolic space features to obtain a comprehensive feature representation.
[0061] The bi-Riemannian manifold processing module includes Euclidean space branches and hyperbolic space branches.
[0062] Here, a dual Riemannian manifold processing module is used to process and fuse the basic feature vectors to generate a comprehensive feature representation. This dual Riemannian manifold processing module contains independent but cooperative Euclidean space branches and hyperbolic space branches. Through the differentiated representation of the basic feature vectors by the two branches and subsequent fusion, the global distinguishing features of food categories are preserved, while the fine-grained hierarchical subordinate features of food ingredients are accurately captured, ultimately outputting a comprehensive feature representation with both global and fine-grained representation capabilities.
[0063] Specifically, in the Euclidean space branch, the basic feature vectors are preserved or linearly transformed. The linear transformation is a combination of convolution and fully connected layers, which can adjust the dimensions and enhance the semantics of the basic feature vectors according to the requirements of food feature representation. If the dimensions and representational capabilities of the basic feature vectors already meet the requirements for class differentiation in Euclidean space, the basic feature vectors can be directly retained as Euclidean space features. Euclidean space features possess translation invariance, clearly defining the global decision boundaries for food categories such as solid / liquid, meat / vegetables, and fresh / cooked food. Furthermore, within the Euclidean space branch, the feature distance between samples is calculated using the standard L2 norm, ensuring the separability of class features.
[0064] In the hyperbolic space branch, the basic feature vector is linearly transformed in the tangent space of the bi-Riemannian manifold to obtain the transition feature vector, namely the hyperbolic tangent space vector. Then, through a learnable Riemannian index mapping layer, the transition feature vector in the tangent space is projected onto the hyperbolic space, realizing the transformation of the basic feature vector into hyperbolic space features. Hyperbolic space features can be embedded into the tree-like hierarchical structure of food, such as meat, in a low-distortion manner. poultry Chicken, vegetables leafy greens spinach.
[0065] Then, the Euclidean space features and hyperbolic space features are concatenated along the channel dimension to obtain a composite representation tensor. This composite representation tensor undergoes cross-space semantic alignment through a nonlinear transformation layer with batch normalization, removing unwanted environmental interference such as global illumination and shooting angle variations of the smart terminal. The semantic information of the two types of features is then fused and enhanced, ultimately outputting a comprehensive feature representation. Thus, by utilizing parallel processing and collaborative representation of Euclidean and hyperbolic space branches, and through differentiated spatial representation and precise feature fusion, the technical problem of a single space being unable to simultaneously represent the mutually exclusive relationships between broad food categories and fine-grained hierarchical relationships is solved, significantly improving the completeness and accuracy of feature representation.
[0066] Optionally, the hyperbolic space branch adopts a Poincaré sphere model with negative curvature; the curvature is constant.
[0067] Here, a constant curvature Poincaré sphere model with negative curvature is used to stably construct the geometric metric rules of hyperbolic space, ensuring the consistency and robustness of hyperbolic space feature mapping. The geometric characteristics of negative curvature can accurately adapt to the hierarchical and tree-like semantic distribution patterns of food categories, strengthening the modeling ability of hyperbolic space for hierarchical semantic features. This compensates for the technical deficiency that a single Euclidean space cannot express hierarchical subordinate relationships. Specifically, the basic feature vectors are projected onto the hyperbolic Riemannian manifold space represented by the negative curvature Poincaré sphere model through Riemann exponent mapping. Based on the geometric constraints of negative curvature, stable hyperbolic space features are generated through Riemann exponent mapping and Möbius summation, and the geodesic distance between hyperbolic features is calculated based on this negative curvature to represent the spatial distribution pattern. By calculating the feature distance between samples through hyperbolic geodesic distance, fine-grained, long-tailed food items at the edge of the hierarchical structure obtain a larger separability gap, accurately representing the fine-grained subordinate relationships of food items. Among them, the Poincaré sphere model with negative curvature can avoid feature drift caused by fluctuations in the geometric parameters of hyperbolic space, ensuring the stability and consistency of semantic feature extraction at the food level, and adapting to the accurate identification needs of food subcategories in smart terminal scenarios.
[0068] Optionally, in the hyperbolic space branch, the processor projects the fundamental feature vectors onto the hyperbolic space using a Riemann exponent mapping to obtain hyperbolic space features, including:
[0069] The processor performs a linear transformation on the basic feature vectors to obtain the hyperbolic tangent space vector.
[0070] The processor projects the hyperbolic tangent space vectors onto the Poincaré sphere model through the Riemann index mapping to obtain the initial hyperbolic space features.
[0071] The processor performs algebraic operations to correct the initial hyperbolic space features based on the Möbius method, thereby obtaining the final hyperbolic space features.
[0072] Here, the basic feature vectors are projected onto hyperbolic space using the Riemann index mapping to obtain hyperbolic space features. The feature transformation process, combining the Riemann index mapping of the Poincaré sphere model with the Möbius stripe of the hyperbolic manifold, ensures low-distortion embedding of the basic feature vectors in hyperbolic space and adheres to the metric axioms of hyperbolic geometry. This allows the hyperbolic space features to accurately represent the tree-like hierarchical structure of the ingredients.
[0073] Specifically, the hyperbolic space adopts a Poincaré sphere model with negative curvature of -c. , where c>0, and c is a preset fixed value used to define the geometric curvature property of hyperbolic space; Let R be a Poincaré sphere model of dimension d and curvature c, where R is the set of real numbers. This model is suitable for the exponential growth characteristics of a hierarchical tree structure of food ingredients. During projection, the basic feature vectors are linearly transformed to the Poincaré sphere model at the origin.x The Euclidean tangent space at 0=0 yields the transition eigenvectors, i.e., the tangent space vectors. v By introducing a learnable Riemann index mapping layer, a conformal factor is used to map and calculate the transition feature vector, which is then projected onto the Poincaré sphere hyperbolic space to generate the initial hyperbolic space features. f hyp The Riemann Exponential Map algorithm is as follows:
[0074] .
[0075] Within the hyperbolic space of a Poincaré sphere, the initial hyperbolic space features are transformed using the Möbius method to obtain the final hyperbolic space features. Because hyperbolic space does not satisfy the commutative and associative laws of addition in Euclidean space, conventional linear addition cannot be directly applied to the features; therefore, the Möbius method, specific to hyperbolic geometry, is used for feature transformation. This transformation is performed on any two feature points in the Poincaré sphere model. The Möbius addition operation is as follows:
[0076] .
[0077] Among them, ⊕ c The Möbius addition operator based on the curvature constant c represents the algebraic operation rule within hyperbolic Riemannian manifolds. x, y The hyperbolic feature vector located within the Poincaré sphere model represents the initial hyperbolic space feature in this embodiment of the present disclosure.
[0078] Thus, by combining cross-space projection of the Riemann index mapping with manifold optimization using the Möbius method, the transformation of basic feature vectors into hyperbolic space features is achieved. This not only realizes an effective mapping from Euclidean space to hyperbolic space but also strictly adheres to the inherent properties of hyperbolic geometry. This ensures the effectiveness and rationality of the hyperbolic space features, and the final generated hyperbolic space features can be accurately and with low distortion embedded into the tree-like hierarchical relationships of the ingredients.
[0079] Combination Figure 2 Optionally, the lightweight student network model is obtained through the following methods:
[0080] S201, the processor initializes the constructed joint training architecture, which includes teacher and student networks, using the source domain sample set.
[0081] S202, the processor uses momentum prototype alignment and topology preservation knowledge distillation based on metric learning to fine-tune the initialized joint training architecture to convergence using labeled samples in the target domain sample set.
[0082] S203, the processor uses the target domain sample set to progressively self-train the fine-tuned joint training architecture.
[0083] S204: After the processor has been progressively self-trained to convergence, it removes the teacher network from the joint training architecture and retains only the forward propagation path of the student network to obtain a lightweight student network model.
[0084] Here, the lightweight student network model is based on a joint training architecture of teacher and student networks. After three stages of optimization—source domain sample set initialization, target domain small sample set fine-tuning (i.e., a small sample set consisting of labeled samples in the target domain), and progressive self-training of the target domain sample set—the training-dedicated teacher network and redundant training modules are removed, retaining only the simplified model adapted for deployment on the edge of smart terminals obtained from the forward propagation path of the student network. This process ensures that the model fully learns the general food knowledge of the source domain and deeply adapts to the real-world scenarios of smart terminals in the target domain, while also achieving model lightweighting, meeting the low computing power and low power consumption hardware requirements of smart terminal edge devices.
[0085] Specifically, a teacher network and a student network with consistent structure are constructed. Both include a feature extraction module, a bi-Riemannian manifold processing module, and a classification layer, forming a joint training architecture for the teacher and student networks. The pre-trained model, consisting of the feature extraction module, bi-Riemannian manifold processing module, and classification layer, is initialized and pre-trained using a source domain sample set. This source domain sample set consists of labeled food image samples collected in a standardized laboratory environment. Through pre-training on this sample set, a pre-trained model with general knowledge is obtained. The teacher network and student network are initialized by replicating the weight parameters of the pre-trained model to learn the general visual features of food, category boundary determination rules, and feature topology, forming a general food recognition knowledge system for the source domain. After initialization, the initial weights of the teacher network are frozen, serving as a fixed carrier of source domain knowledge, while the weight parameters of the student network are used as the optimization targets for subsequent training.
[0086] Based on metric learning, momentum prototype alignment and topology-preserving knowledge distillation are employed. Labeled samples from the target domain sample set are used to fine-tune the initialized joint training architecture with small samples until the model converges. The labeled samples in the target domain consist of a very small number of labeled food images from real-world smart terminal environments. Momentum prototype alignment based on metric learning maintains food category prototype centers in both the Euclidean and hyperbolic spaces of the dual Riemannian manifold processing module during fine-tuning, and dynamically and smoothly updates these centers using exponential moving average (EMA). This results in highly stable dual-manifold prototypes, ensuring the compactness of features for similar food categories. Subsequently, a prototype Softmax classifier based on Riemannian metric is constructed to calculate the negative exponential distribution of the distance between dual-space features and class prototypes. To avoid catastrophic forgetting and overfitting during small-sample cross-domain fine-tuning, a topology-preserving knowledge distillation loss is calculated. This loss includes a distance distillation loss to maintain feature scale consistency and an angular distillation loss to maintain semantic direction consistency. This forces the student network to adapt to the target domain smart terminal environment while preserving the feature topology of the source domain in the teacher network. By updating the student network parameters through backpropagation, the teacher network does not participate in gradient updates, thus completing the initial adaptation of the student network to the target domain smart terminal scenario.
[0087] By leveraging the target domain sample set to perform distribution-aware, progressive self-training of the fine-tuned joint training architecture, the value of unlabeled sample data in the target domain is fully exploited. During self-training, the teacher network predicts massive amounts of unlabeled samples, and a dynamic filtering strategy is used to select unlabeled samples and generate pseudo-labels, effectively identifying rare, long-tailed data. The pseudo-labeled samples are then mixed with labeled samples from the target domain and input into the student network for end-to-end closed-loop iterative optimization, continuously updating the student network parameters. The teacher network parameters are smoothly updated by the student network using an exponential moving average, filtering out noise gradients caused by erroneous pseudo-labels. This process continuously optimizes the classification boundary of the teacher network until the recognition accuracy of the student network on the target domain sample set stabilizes, indicating model convergence.
[0088] After the joint training architecture converges through progressive self-training, the trained architecture is streamlined by removing the teacher network dedicated to training. Simultaneously, all redundant training auxiliary modules in the inference stage, such as the EMA prototype momentum storage matrix, topological knowledge distillation loss calculation nodes, and pseudo-label filtering modules, are removed, retaining only the complete forward propagation path of the student network. This lightweight model retains only the structure required for food recognition forward inference, significantly reducing the model's computational overhead and memory usage.
[0089] Combination Figure 3Optionally, in S202, the processor, based on metric learning-based momentum prototype alignment and topology-preserving knowledge distillation, fine-tunes the initialized joint training architecture using labeled samples from the target domain sample set, including:
[0090] S221, the processor updates the class prototype center of the dual Riemannian processing module based on labeled samples in the target domain through a momentum prototype alignment mechanism to obtain a stable dual Riemannian processing module.
[0091] S222, the processor constructs cross-entropy loss and prototype contrast loss based on the prototype center of the stabilized dual Riemannian manifold processing module.
[0092] S223, the processor distills based on topology-preserving relation knowledge and constructs distillation loss.
[0093] In S224, the processor uses the weighted sum of cross-entropy loss, prototype contrast loss, and distillation loss as the total loss function. Based on the total loss function, it executes the backpropagation algorithm to calculate the joint gradient with respect to the weights of each layer of the student network in order to update the student network.
[0094] Here, a momentum prototype alignment mechanism is used to address the problem of prototype center drift in cross-domain scenarios of smart terminals. The feature discrimination ability of the student network is enhanced by combining prototype clustering loss and classification loss, and the source domain general knowledge of the student network is preserved based on topology-preserving relation knowledge distillation. This effectively avoids overfitting and catastrophic forgetting during small-sample fine-tuning, while also adapting to the domain shift problem caused by complex lighting and varying shooting distances within smart terminals, reducing dependence on labeled sample data in the target domain.
[0095] Specifically, based on labeled samples in the target domain, the Euclidean space prototype center and hyperbolic space prototype center of the dual Riemannian manifold processing module are dynamically updated using a momentum prototype alignment mechanism (i.e., exponential moving average EMA mechanism). This continues until the prototype centers stabilize (e.g., no significant fluctuations for 5-10 consecutive iterations), ensuring that the prototype centers accurately adapt to the feature distribution of objects in the target domain under complex environments of smart terminals (dim lighting, occlusion, and varying shooting distances). In this way, the dynamic updates through the momentum prototype alignment mechanism allow the prototype centers to adaptively adjust to follow the feature distribution of labeled samples in the target domain, avoiding feature matching bias caused by a fixed prototype center. Simultaneously, the smoothing effect of the exponential moving average effectively suppresses prototype drift caused by small sample batch fluctuations, ensuring the stability and accuracy of the prototype centers.
[0096] After the class prototype centers of the dual Riemannian manifold processing module stabilize, they are used as the metric to construct cross-entropy loss and prototype contrast loss, respectively, to dual-constrain the feature learning performance of the student network. The prototype contrast loss is constructed based on the distance between the dual-space features output by the dual Riemannian manifold processing module and the corresponding class prototype centers. L proto :
[0097] .
[0098] in, L proto This represents the prototype contrast loss, used to constrain features of similar food items to cluster towards the corresponding prototype center. This represents the L2 norm distance in Euclidean space, used to calculate the distance between a feature in Euclidean space and the center of the corresponding class prototype. Represents the hyperbolic geodesic distance in hyperbolic space, used to calculate the distance between hyperbolic space features and the corresponding class prototype center. y i This represents the true class label of the i-th sample.
[0099] The classification layer of the student network, which uses the predicted probabilities of a prototype Softmax classifier based on Riemann metric, constructs a cross-entropy loss with the sample's true label. Thus, the prototype clustering loss and the cross-entropy loss form a dual constraint. The prototype clustering loss ensures clustering of similar classes and separation of dissimilar classes at the feature level, reducing feature confusion caused by the domain shift of smart terminals. The cross-entropy loss optimizes classification accuracy at the prediction result level, adapting to the accurate recognition of multiple object classes in smart terminal scenarios. The two work synergistically to effectively improve the student network's ability to recognize and classify object features in complex smart terminal environments, reducing classification errors.
[0100] To preserve the general source domain knowledge carried by the teacher network and avoid overfitting and catastrophic forgetting in the student network during fine-tuning with few samples, a distillation loss is constructed based on Relational Knowledge Distillation (RKD). Labeled samples from the target domain are simultaneously input into both the teacher and student networks. The topological relationships of the source domain features output by the teacher network and the topological relationships of the target domain features output by the student network are obtained, respectively. A relative knowledge distillation loss is used to measure the difference in the topological relationships of the features between the student and teacher networks. This ensures that the student network retains the topological structure of the source domain features while learning the target domain features. During the fine-tuning phase, the teacher network's parameters are frozen, serving only as topological guidance and not participating in gradient backpropagation, ensuring that its source domain knowledge is not interfered with by small samples in the target domain.
[0101] The cross-entropy loss, prototype contrast loss, and distillation loss are weighted and summed to construct the total loss function. Based on the total loss function, the backpropagation algorithm is executed to calculate the joint gradient of the total loss with respect to the weights of each layer in the student network. The gradient descent algorithm is used to iteratively update the weights of each layer in the student network along the reverse direction of the gradient. After each iteration, the total loss value is calculated until the total loss value is continuously lower than the preset loss threshold multiple times. At this point, the model fine-tuning is considered to have converged, completing the fine-tuning process for labeled samples in the target domain.
[0102] Combination Figure 4 Optionally, in step S221, the processor updates the class prototype center of the dual Riemannian processing module based on labeled samples in the target domain using a momentum prototype alignment mechanism to obtain a stable dual Riemannian processing module, including:
[0103] S2101, the processor acquires a cross-domain food image sample set, which includes source domain food images and target domain food images.
[0104] S2102, the processor constructs corresponding class prototype centers in Euclidean space in the Euclidean space branch and hyperbolic space in the hyperbolic space branch based on the cross-domain food image sample set, and updates the prototype centers of each class.
[0105] The S2103 processor calculates gradients and backpropagates them based on updated prototype centers and corresponding prototype contrast losses in Euclidean and hyperbolic spaces, optimizing the network parameters of the dual Riemannian manifold processing module.
[0106] S2104, the processor iteratively optimizes the dual Riemannian manifold processing module until the loss converges, thus obtaining a stable dual Riemannian manifold processing module.
[0107] Here, a cross-domain food image sample set is collected and constructed. The sample set includes a sample set of food images from the source domain and a sample set of food images from the target domain. The food images in the source domain sample set include a small number of labeled food images collected under a standardized laboratory environment, while the food images in the target domain sample set include a large number of unlabeled food images collected in real-world scenarios on smart terminals, which fits the actual application scenario of small cross-domain samples on smart terminals.
[0108] Based on a cross-domain food image sample set, corresponding class prototype centers are constructed for each food category in both the Euclidean space (within the Euclidean space branch) and the hyperbolic space (within the hyperbolic space branch). These class prototype centers represent the mean features of the corresponding food category. To avoid prototype center drift caused by small-sample gradient updates, an exponential moving average (EMA) is used to dynamically update the prototype centers for each category. The update formula for the prototype centers in the t-th iteration is:
[0109] .
[0110] in, This indicates the result after the t-th iteration. k The class prototype center for food-like ingredients, including the class prototype center in Euclidean space. P euc,k With the prototype center of hyperbolic space P hyp,k γ represents the momentum coefficient of the exponential moving average, which is a preset fixed constant used to control the update weight of the historical prototype center. B k Indicates that the current training batch belongs to the first... k A sample set of similar ingredients. In the t-th iteration, the first... i The dual-space features extracted from each sample by the dual Riemannian manifold processing module include Euclidean space features. f euc,i Hyperbolic space characteristics f hyp,i .
[0111] The network parameters are optimized using prototype contrastive loss. Based on the updated prototype centers, Euclidean space prototype contrastive loss and hyperbolic space prototype contrastive loss are calculated separately, and the two types of losses are summed to obtain the joint prototype contrastive loss. The network gradient is calculated based on the prototype contrastive loss and backpropagated to the dual Riemannian manifold processing module to iteratively optimize the linear transformation parameters of the Euclidean space branch, the Riemann exponent mapping parameters of the hyperbolic space branch, and the Möbius summation operation parameters.
[0112] The optimization process iteratively executes forward inference of samples, prototype center update, prototype contrastive loss calculation, and gradient backpropagation until the prototype contrastive loss no longer decreases and tends to converge smoothly. At this point, the network parameters of the bi-Riemannian manifold processing module reach their optimal state, and the trained bi-Riemannian manifold processing module is obtained.
[0113] Combination Figure 5 Optionally, in step S2102, the processor constructs corresponding class prototype centers based on the cross-domain food image sample set, in the Euclidean space of the Euclidean space branch and the hyperbolic space of the hyperbolic space branch, respectively, including:
[0114] S2121, the processor initializes the prototype centers of each food category in Euclidean space and hyperbolic space.
[0115] S2122, the processor calculates the mean value of the corresponding category features based on the features of similar food samples in the current training batch.
[0116] S2123, the processor smoothly updates the corresponding class prototype center in Euclidean space based on the calculated mean and exponential moving average of the corresponding class features.
[0117] S2124, the processor uses Möbius summation and exponential moving average to update the class prototype center corresponding to the hyperbolic space.
[0118] Here, for all food categories, the class prototype centers of the Euclidean space branch and the hyperbolic space branch are initialized separately. Initial Euclidean space class prototype centers for each food category can be assigned using zero-vector initialization or random initialization. P euc,k and Hyperbolic Space Class Prototype Center P hyp,k Preferably, various prototype centers are initialized based on the source domain sample set to learn general and stable food category / subcategory features as initial prototype values.
[0119] Then, the class prototype is updated based on the target sample set. The Euclidean space features of each food sample in the current training batch are extracted and processed by the dual Riemannian manifold processing module. f euc,i Hyperbolic space characteristics f hyp,i Samples of similar ingredients are grouped according to their true labels. The arithmetic mean of the Euclidean space features and hyperbolic space features for each ingredient category within the current batch is calculated to obtain the corresponding category feature mean. Euclidean space supports standard numerical operations; therefore, standard numerical averaging and exponential moving average (EMA) are used to smoothly update the class prototype centers, suppressing prototype drift caused by small sample batches and ensuring update stability. Hyperbolic space is a negative curvature Riemannian manifold; the Möbius method and exponential moving average (EMA) are used to update the class prototype centers, strictly adhering to the geometric operation rules of the hyperbolic manifold to ensure the closure and accuracy of prototype updates. Through this dual-space differentiated prototype update method, the Euclidean space prototype relies on linear operations for smooth stability, while the hyperbolic space prototype relies on the Möbius method to adapt to the manifold geometry. The two work together to ensure the accuracy and reliability of the class prototype centers in the dual Riemannian manifold processing module, adapting to cross-domain small sample ingredient classification scenarios.
[0120] Optionally, in step S2121, the processor initializes the prototype centers of each food category in Euclidean space and hyperbolic space, including:
[0121] The processor initializes the prototype centers of each category in Euclidean space as zero vectors or small-scale random vectors.
[0122] The processor initializes the prototype centers of each category in hyperbolic space as vectors around the origin of the Poincaré sphere model, ensuring that the initial prototype centers are constrained within the hyperbolic manifold.
[0123] Here, a differentiated initialization strategy is adopted to address the differences in geometric properties between Euclidean and hyperbolic spaces. This ensures that the initialization of Euclidean space prototypes is concise and efficient, without interfering with subsequent smooth updates. Simultaneously, it ensures that the initial state of the hyperbolic space prototype conforms to the manifold constraints of the Poincaré sphere model, preventing computational failures caused by the initial prototype exceeding the hyperbolic manifold's range, thus providing a stable foundation for iterative updates of class prototype centers. Specifically, Euclidean space is a linear, flat geometric space that supports standard linear algebra operations and has no strict spatial range constraints on the initial vector. Therefore, the class prototype centers corresponding to various food items in Euclidean space are uniformly initialized to zero vectors or small-scale random vectors with extremely small value ranges. This initialization method is concise and efficient, quickly providing baseline values for Euclidean space prototypes without interfering with the smooth update process based on prototypes, ensuring the stability of Euclidean space prototype iterations.
[0124] The hyperbolic space branch uses a Poincaré sphere model with negative constant curvature. This manifold space has strict spatial constraints. To ensure the closure and validity of operations such as the Möbius addition and Riemann index mapping, the hyperbolic space features and class prototype centers must be located inside the Poincaré sphere model. Therefore, the class prototype centers corresponding to various food items in hyperbolic space are initialized as small-scale vectors around the origin of the Poincaré sphere model, such that the initial prototype centers satisfy... The manifold constraint ensures that the initial prototype center is strictly confined within the effective space of the hyperbolic manifold, preventing geometric distortions in subsequent hyperbolic space prototype updates and feature metric calculations due to the initial prototype going out of bounds, thus ensuring the reliability of hyperbolic space hierarchical semantic modeling.
[0125] Optionally, in step S2123, the processor smoothly updates the corresponding class prototype center in Euclidean space based on the calculated mean and exponential moving average of the corresponding class features, including:
[0126] The processor presets the weighting coefficients for the exponential moving average.
[0127] The processor performs a weighted fusion of the historical prototype centers from the previous iteration and the feature mean of similar samples in the current training batch to update the prototype centers in Euclidean space for the current training batch.
[0128] Here, exponential moving averages are used to smoothly integrate the features of historical prototypes and the current batch. Combined with standard numerical averaging to stably calculate the mean of features for similar samples, this effectively suppresses prototype center drift in small-sample cross-domain scenarios and ensures the iterative stability of Euclidean space prototype centers. Specifically, a pre-set weight coefficient γ for the exponential moving average (EMA), which is a preset hyperparameter between 0 and 1, is used to control the weight ratio of historical prototype centers during the update process, balancing historical prototype information with current batch sample feature information, and avoiding drastic fluctuations in prototype centers caused by single batch sample volatility.
[0129] The mean of Euclidean space features for samples of the same food category within the current training batch is calculated using a standard numerical average. This means taking the arithmetic mean of the Euclidean space features of all samples belonging to the same food category in the current batch, thus obtaining the category feature center for the current batch. Then, the historical Euclidean space class prototype centers from the previous iteration are weighted and fused with the mean feature values of similar samples in the current batch, and a smooth update is performed using an exponential moving average method to obtain the final Euclidean space class prototype centers for the current training batch. The Euclidean space prototype update formula is as follows:
[0130] .
[0131] in, This indicates the result after the t-th iteration. k The Euclidean space prototype center for food-like ingredients. In the t-th iteration, the first... i The Euclidean space features of each sample were extracted by the dual Riemannian manifold processing module. The weight of the historical prototype center is the momentum coefficient γ, and the weight of the current batch feature mean is (1) γ), through weighted fusion update, obtains the Euclidean space class prototype center of the current training batch, effectively suppressing prototype center drift in small sample scenarios. Thus, through weighted fusion update, the Euclidean space class prototype center maintains a smooth transition during iteration, avoiding significant shifts due to sample distribution deviations in small batches of the target domain.
[0132] Optionally, the weighting coefficient γ of the exponential moving average is 0.9. A larger γ indicates a very high proportion of old prototypes, minimal impact from new batches of samples, and extremely stable, stable prototype updates. A smaller γ indicates a low proportion of old prototypes, significant impact from new batches of samples, and rapid but volatile prototype updates. In this embodiment, γ is typically set to 0.9, allowing the prototype to update slowly with historically stable information, preventing it from being skewed by small batches of samples, thus balancing stability and update efficiency.
[0133] Optionally, in S2124, the processor uses the Möbius summation and exponential moving average to update the class prototype center corresponding to the hyperbolic space, combining the calculated mean of the corresponding class features, including:
[0134] The processor presets the weighting coefficients for the exponential moving average.
[0135] The processor uses the Möbius method to weight and fuse the historical prototype centers of the previous iteration with the feature mean of the same type of samples in the current batch, so as to update the prototype centers of the hyperbolic space of the current training batch.
[0136] Here, the Möbius method is adapted to the geometric operation rules of hyperbolic manifolds, and combined with exponential moving average to achieve smooth prototype updates. This ensures the operational closure of the hyperbolic space class prototype center update while effectively suppressing prototype drift in small sample cross-domain scenarios, thus improving the stability of hyperbolic space hierarchical semantic modeling. The weight coefficients of the preset exponential moving average are described above and will not be repeated here. The mean hyperbolic space features of similar food samples within the current training batch are calculated using standard numerical averaging to obtain the hyperbolic space class feature centers of the current batch. Then, based on the geometric operation rules of hyperbolic Riemannian manifolds, the Möbius method is used to weight and fuse the historical hyperbolic space class prototype centers obtained from the previous iteration with the mean hyperbolic space features of similar samples in the current batch, completing the exponential moving average smooth update. Finally, the hyperbolic space class prototype centers of the current training batch are obtained.
[0137] The formula for updating the prototype center of the hyperbolic space class is as follows: .
[0138] in, This indicates the result after the t-th iteration. k The hyperbolic space prototype center of food-like ingredients In the t-th iteration, the first... i Hyperbolic space features extracted from each sample using the double Riemannian manifold processing module.
[0139] Combination Figure 6 Optionally, in step S2103, the processor calculates the gradient and backpropagates it based on the updated prototype centers and the prototype contrast loss corresponding to Euclidean space and hyperbolic space, optimizing the network parameters of the dual Riemannian manifold processing module, including:
[0140] S2131, the processor uses the L2 norm to calculate the spatial distance between the Euclidean space features of the sample and the corresponding updated class prototype center in Euclidean space.
[0141] S2132, the processor calculates the spatial distance between the hyperbolic space features of the sample and the corresponding updated class prototype center using hyperbolic geodesic distance in hyperbolic space.
[0142] S2133, the processor constructs prototype contrast loss in Euclidean space and prototype contrast loss in hyperbolic space based on the spatial distance calculated in each space, forming a joint prototype contrast loss.
[0143] S2134, the processor optimizes the network parameters of the dual Riemannian manifold processing module by minimizing the joint prototype contrast loss through backpropagation.
[0144] Here, by using a dual-space differential distance metric and joint loss constraint, features of similar food items are forced to cluster towards their corresponding class prototype centers in both spaces. This improves the compactness of intra-class features and the discriminative power between classes, further enhancing the module's ability to model the semantics of food items at the hierarchical level and adapting to cross-domain small-sample classification scenarios. Specifically, the Euclidean space class prototype centers and hyperbolic space class prototype centers, updated based on exponential moving averages, are used as benchmarks for measuring sample feature similarity, to assess the degree of matching between sample features and their respective class centers. For Euclidean space features, the L2 norm is used to calculate the spatial distance between the sample's Euclidean space features and the corresponding updated class prototype centers. The calculation formula is as follows:
[0145] .
[0146] For hyperbolic space features, the hyperbolic geodesic distance specific to hyperbolic space is used to calculate the spatial distance between the hyperbolic space features of the sample and the corresponding updated class prototype center. The calculation formula is as follows:
[0147] .
[0148] Furthermore, the spatial distance between the obtained hyperbolic spatial features of the sample and the corresponding updated class prototype center, and the corresponding hyperbolic geodesic distance formula, can also be:
[0149] .
[0150] Euclidean space prototype contrast loss is constructed based on Euclidean space distance, and hyperbolic space prototype contrast loss is constructed based on hyperbolic space distance. The two types of losses are directly summed to form a joint prototype contrast loss, which is the total prototype contrast loss. The formula for calculating the prototype contrast loss is:
[0151] .
[0152] in, L proto This represents the prototype contrast loss, used to constrain features of similar food items to cluster towards the corresponding prototype center. This represents the L2 norm distance in Euclidean space, used to calculate the distance between a feature in Euclidean space and the center of the corresponding class prototype. Let represent the Euclidean space feature of the i-th sample. This represents the Euclidean space class prototype center of the category to which the i-th sample belongs; Represents the hyperbolic geodesic distance in hyperbolic space, used to calculate the distance between hyperbolic space features and the corresponding class prototype center. Let represent the hyperbolic space feature of the i-th sample. The hyperbolic space class prototype center represents the category to which the i-th sample belongs. y i Let be the true class label of the i-th sample.
[0153] Using the joint prototype contrastive loss as the optimization objective, the gradient of the loss with respect to the network parameters of the bi-Riemannian manifold processing module is calculated, and this gradient is propagated backward along the network's forward propagation path. The network parameters of the bi-Riemannian manifold processing module are iteratively updated using a gradient descent algorithm, continuously minimizing the joint prototype contrastive loss until convergence. This means the distance between sample features and the corresponding class prototype centers continuously decreases, resulting in a stable bi-Riemannian manifold processing module.
[0154] Optionally, in step S2132, the processor calculates the spatial distance between the hyperbolic space features of the sample and the corresponding class prototype center using hyperbolic geodesic distance in hyperbolic space, including:
[0155] The processor uses the Möbius method to calculate the relative vector between the hyperbolic space features of the sample and the corresponding class prototype center.
[0156] The processor calculates the Euclidean norm of the relative vector and substitutes the Euclidean norm into the hyperbolic geodesic distance formula to calculate the spatial distance between the hyperbolic spatial features of the sample and the corresponding class prototype center.
[0157] Here, the Möbius strip method specific to hyperbolic space is used to calculate the relative vector between the hyperbolic space features of the sample and the corresponding class prototype center. Δf The calculation formula is as follows: For relative vectors Δf Calculate the standard Euclidean norm. Substituting the calculated Euclidean norm into the predefined hyperbolic geodesic distance formula, the spatial distance between the hyperbolic spatial features of the sample and the corresponding class prototype center is obtained: .
[0158] In this way, the distance between sample features and class prototype centers is calculated under the constraints of hyperbolic manifold, ensuring that the measurement method conforms to the hyperbolic geometric rules and improving the modeling accuracy of food hierarchical semantics in hyperbolic space.
[0159] Optionally, iterative optimization is performed until the loss converges to obtain a stable dual Riemannian manifold processing module, including:
[0160] After each iteration completes the class prototype center update and network parameter optimization, the processor calculates the total loss value of the Euclidean space prototype contrast loss and the hyperbolic space prototype contrast loss.
[0161] When the total loss value is lower than the preset loss threshold for multiple consecutive iterations, the processor stops iterating and obtains a stable dual Riemannian manifold processing module.
[0162] Here, by continuously monitoring the convergence state of the total loss value over multiple rounds, it is possible to accurately determine whether the network parameters have reached the optimal level, avoiding underfitting of the model due to premature stopping of iteration or overfitting of the model due to excessive iteration, and ensuring that the trained dual Riemannian manifold processing module has stable recognition performance in cross-domain small sample food classification scenarios.
[0163] In each iteration, after updating the prototype centers in both spatial domains and optimizing the network parameters via gradient descent, the Euclidean space prototype contrast loss and the hyperbolic space prototype contrast loss are summed to obtain the total loss value for the current iteration, i.e., the total prototype contrast loss mentioned earlier, to evaluate the current optimization level of the model. A loss threshold and the number of consecutive iterations required for convergence are pre-set; the total loss value obtained in each iteration is continuously monitored during training. When the total loss value is consistently lower than the preset loss threshold for multiple iterations, it indicates that the Euclidean space features and hyperbolic space features have sufficiently converged to the corresponding prototype centers, and the model has converged to a stable state. At this point, the iterative optimization process is stopped, resulting in a trained bi-Riemannian manifold processing module, i.e., a highly stable bi-Riemannian manifold processing module. Thus, through convergence determination, the optimal training stopping point of the model is accurately determined, ensuring that the feature extraction and classification performance of the bi-Riemannian manifold processing module reaches its best, meeting the needs of cross-domain small-sample food image classification for smart terminals.
[0164] Optionally, in step S222, the processor constructs a cross-entropy loss based on the class prototype center of the stabilized dual Riemannian manifold processing module, including:
[0165] The processor calculates the Euclidean L2 distance and hyperbolic geodesic distance between the sample bispatial features and the corresponding stable class prototype center.
[0166] The processor is based on the Riemannian metric prototype Softmax classifier, which converts Euclidean L2 distance and hyperbolic geodesic distance into classification prediction probabilities through a negative exponential distribution.
[0167] The processor calculates the cross-entropy based on the predicted probability and the true label of the sample, and obtains the cross-entropy loss.
[0168] Here, the dual-space features include Euclidean space features and hyperbolic space features. The Euclidean L2 distance and hyperbolic geodesic distance between the sample's dual-space features and the corresponding stable class prototype center are calculated separately to ensure that the distance metrics conform to the geometric rules of the bi-Riemannian manifold. A prototype Softmax classifier based on Riemannian metrics is used to convert the calculated Euclidean L2 distance and hyperbolic geodesic distance into classification prediction probabilities through a negative exponential distribution, strengthening the correlation between samples of the same class and their corresponding prototype centers and suppressing interference from samples of different classes. Among these, the sample...x i Predicted probability of belonging to category k The negative exponential distribution for calculating the distance between bi-space features and class prototypes is as follows:
[0169]
[0170] Here, T is a temperature coefficient, a hyperparameter used to scale the spatial distance between the two Riemannian manifolds and smooth the classification probability distribution, thereby improving the stability and generalization ability of the model under small sample size and pseudo-label training. Based on the classification prediction probability and the true class label of the sample, the cross-entropy value is calculated to obtain the cross-entropy loss. L CE .
[0171] .
[0172] in, Let K be the predicted probability that the i-th sample belongs to the k-th class, where K is the total number of food categories. y i,k Let the true class label of the i-th sample be the class k to which it belongs. When sample i belongs to class k... y i,k =1, when sample i does not belong to the k-th class y i,k =0; z i,k This represents the original output of the classification layer for the i-th sample belonging to the k-th class. z i,j is the original output of the classification layer where the i-th sample belongs to the j-th class; B is the training batch size. This represents the summation of all samples within the current training batch, to account for the cross-entropy loss. L CE The loss is on the same order of magnitude as the prototype mentioned above.
[0173] Thus, by combining dual spatial distance metrics with the Riemannian metric prototype Softmax classifier, the cross-entropy loss can accurately adapt to the feature output of the dual Riemannian manifold processing module. This avoids classification bias caused by a single spatial distance metric, and strengthens the correlation between similar samples and their corresponding prototype centers through negative exponential distribution transformation. This improves the optimization effect of cross-entropy loss on classification results, adapts to the complex feature distribution in cross-domain small-sample scenarios on smart terminals, and further enhances the model's classification accuracy.
[0174] Combination Figure 7 Optionally, in S223, the processor constructs a distillation loss based on topology-preserving relation knowledge distillation, including:
[0175] S2231, the processor inputs the current training batch samples into the student network to obtain the student feature set; and inputs the current training batch samples into the teacher network to obtain the teacher feature set.
[0176] S2232, the processor calculates the Euclidean spatial distance and hyperbolic geodesic distance of any sample pair based on the teacher feature set and the student feature set, respectively, to construct the distance distillation loss.
[0177] S2233, the processor calculates the cosine similarity of any sample to unit features based on the teacher feature set and the student feature set, and constructs the angular distillation loss.
[0178] S2234, the processor sums the distance distillation loss and the angle distillation loss to obtain the distillation loss.
[0179] Here, through the synergistic effect of dual spatial distance distillation and angular distillation, the constructed topological constraint loss, namely distillation loss, can accurately constrain the feature topology structure of the student network to remain consistent with that of the teacher network, ensuring robust transfer of general knowledge from the source domain and avoiding overfitting and catastrophic forgetting in the cross-domain small sample fine-tuning of the student network on the smart terminal. At the same time, it adapts to the feature distribution characteristics of the dual Riemannian manifold, further improving the cross-domain generalization ability of the model and meeting the needs of object classification in the complex environment inside the smart terminal.
[0180] Specifically, labeled samples from the target domain of the current training batch are simultaneously input into both the student network and the teacher network, and feature sets output by the two types of networks are obtained respectively. Among these, the student feature set... F S For student characteristics f S The set of teacher characteristics F T Teacher characteristics f T The set of features. Based on the teacher feature set and the student feature set, all sample pairs are traversed to calculate the Euclidean L2 distance in Euclidean space and the hyperbolic geodesic distance in hyperbolic space. The distance distillation loss is constructed by combining the two types of distances. L R-dist Simultaneously, based on the teacher and student feature sets, the dual-space features of each sample are normalized, and then the cosine similarity between unit features is calculated. An angular distillation loss is constructed based on the similarity difference. L R-angle Summing the distance distillation loss and the angle distillation loss yields the topological constraint loss. L RKD = L R-dist + L R-angle .
[0181] Optionally, in step S2232, the processor calculates the Euclidean spatial distance and hyperbolic geodesic distance for any sample pair based on the teacher feature set and the student feature set, respectively, to construct the distance distillation loss, including:
[0182] The processor calculates the Euclidean distance and hyperbolic geodesic distance between any pair of samples in the teacher feature set and the student feature set, respectively.
[0183] The processor normalizes the calculated Euclidean space distance and hyperbolic geodesic distance by means to obtain the normalized Euclidean space relative distance and hyperbolic geodesic relative distance.
[0184] The processor uses the Huber loss function to constrain the consistency of the distribution of normalized Euclidean space relative distance and hyperbolic geodesic relative distance, and constructs distance distillation loss.
[0185] Here, based on the dual geometric metric rules of Euclidean and hyperbolic spaces, a three-step process—bi-spatial distance calculation, mean normalization, and Huber loss joint constraint—precisely quantifies the topological deviation of the teacher-student network in the relative distance of bi-spatial features. This topological deviation only constrains the relative scale relationship of the student network's bi-spatial features to remain consistent with that of the teacher network, ensuring that the student network fully retains the bi-spatial feature scale topology learned during pre-training in the source domain when adapting to the target domain's intelligent terminal environment.
[0186] In detail, calculate the features of any sample in the teacher feature set separately. i , j The Euclidean spatial distance and hyperbolic geodesic distance between pairs of samples, and any sample pair in the student feature set ( i , j The Euclidean spatial distance and hyperbolic geodesic distance of the sample pair features are calculated. The standard L2 norm is used to calculate the Euclidean spatial distance of the sample pair features. Based on the Poincaré sphere model, Möbius algebra, and the hyperbolic geodesic distance formula, the shortest path distance of the sample pair features along the hyperbolic manifold surface is calculated. Wherein, any sample pair ( i , j The Euclidean distance of ) is ; any sample pair ( i , j The hyperbolic geodesic distance is .
[0187] Calculate the batch mean of the Euclidean distance for the teacher and student networks respectively. μ T,euc , μ S,euc and the average distance of hyperbolic geodesic lines from the batch μ T,hyp , μ S,hypThe bispatial distances are normalized to eliminate global scale differences in the teacher-student network, yielding normalized Euclidean spatial relative distances and normalized hyperbolic geodesic relative distances. Taking the teacher network as an example, the formulas for calculating the corresponding normalized Euclidean spatial relative distances and normalized hyperbolic geodesic relative distances are as follows:
[0188] , .
[0189] in, Teacher network samples with Euclidean distance, Teacher network sample for hyperbolic geodesic distance; , These are the normalized Euclidean distance and hyperbolic geodesic distance, respectively. After normalization, the relative distances between the teacher and student networks are on the same scale, ensuring that the calculation of topological bias is not distorted.
[0190] The Huber loss function is used to constrain the consistency of the distribution of normalized Euclidean space relative distances and hyperbolic geodesic relative distances, thus constructing a distance distillation loss function. L R-dist The distance distillation loss is:
[0191] .
[0192] μ S μ T These are the mean batch feature distances for the student network and the teacher network, respectively. Huber loss function The hyperparameters are determined by this loss. This loss forces the student network to maintain a high degree of consistency with the teacher network in the relative distance distribution of features in both Euclidean and hyperbolic spaces, achieving synchronous inheritance of topology across both spatial scales. This ensures fine-grained optimization with minimal distance deviation while avoiding loss oscillations caused by outliers.
[0193] Optionally, S2233, the processor calculates the cosine similarity of any sample to a unit feature based on the teacher feature set and the student feature set, respectively, and constructs an angular distillation loss, including:
[0194] The processor normalizes the bispace features in the teacher feature set and the student feature set to obtain unit bispace features.
[0195] The processor calculates the cosine similarity between any sample in the teacher feature set and the student feature set and the unit bispace feature, respectively.
[0196] The processor uses the mean squared error loss function to constrain the distribution consistency of cosine similarity and constructs an angular distillation loss.
[0197] Here, by normalizing the dual-space features, the angular measurement bias caused by different feature scales is eliminated, ensuring the accuracy of cosine similarity calculation. The mean squared error loss function can accurately constrain the distribution consistency of the cosine similarity of teacher and student features, strengthen the topological constraint of samples on relative angles, and adapt to the feature distribution characteristics of the bi-Riemannian manifold. At the same time, it suppresses the angular bias caused by the fluctuation of sample features in the complex environment of smart terminals.
[0198] In detail, the dual-space features in the student and teacher feature sets are respectively normalized to obtain unit features. Among them, teacher features... Corresponding unit dual-space features Student characteristics Corresponding unit dual-space features The calculation formulas are as follows: , The normalized feature vector has a magnitude of 1, representing only the semantic direction of the feature in the bi-Riemannian manifold space. Based on the unit bi-space features, a set of teacher unit features is obtained. E T Student unit feature set E S It iterates through all unit features in both sets, calculates the cosine similarity between any pair of samples within each set, and fully quantifies the relative semantic direction relationship between features. (Cosine similarity of teacher sample pairs) Cosine similarity of student sample pairs The calculation formulas are as follows: , Iterate through all sample pairs ( i , j This yields the cosine similarity set of sample pairs for the teacher network and the cosine similarity set of sample pairs for the student network.
[0199] The mean squared error (MSE) loss function is used to constrain the cosine similarity of sample pairs in the teacher-student network. By quantifying the deviation between the two and summing them, an angular distillation loss is constructed. L R angle This method forces the cosine similarity distribution of the student network to be consistent with that of the teacher network, achieving distortion-free inheritance of the topology of features relative to semantic directions. The mean squared error loss function can accurately quantify the degree of deviation of continuous values, adapting to the numerical distribution characteristics of cosine similarity, and has low computational complexity, not increasing the computational cost of model training, thus meeting the lightweight training needs of smart edge devices. The formula for calculating the angular distillation loss is:
[0200]
[0201] The magnitude of this loss value is positively correlated with the topological deviation between the student network and the teacher network in terms of the relative angle of features. The smaller the loss value, the more complete the topological structure of the source domain feature semantic direction preserved by the student network. Thus, by using angular distillation loss to constrain the student network from the dimension of relative semantic direction of features, it ensures that the relative angle between any two food features in the student network remains consistent with that of the teacher network. This prevents the student network from disrupting the semantic association topology of food features learned during pre-training in the source domain when fine-tuning with a very small number of labeled samples from the target domain. The cosine similarity of hyperbolic features is calculated as the Euclidean cosine similarity in the tangent space at the origin. Optionally, when calculating the distillation loss, only the distance distillation loss and angular distillation loss of Euclidean space features can be calculated; that is, the hyperbolic space branch does not participate in the topology-preserving relation knowledge distillation. The hyperbolic space branch can directly constrain the hyperbolic features of the student network to converge towards the teacher prototype through prototype contrast loss, achieving stable replication of the hierarchical structure. This further reduces the model training / inference complexity, and the resulting model is better suited for lightweight deployment on smart edge devices.
[0202] Combination Figure 8 Optionally, in step S203, the processor performs progressive self-training on the fine-tuned joint training architecture using the target domain sample set, including:
[0203] In S231, the processor uses the teacher network to infer unlabeled samples in the target domain sample set in each training iteration to generate candidate pseudo-labels.
[0204] S232, the processor uses a dynamic confidence threshold screening strategy to verify candidate pseudo-labels to obtain highly reliable pseudo-labels; the unlabeled samples with highly reliable pseudo-labels are mixed with labeled samples to form a mixed training set.
[0205] S233, the processor inputs samples from the mixed training set into the fine-tuned joint training architecture for training, updates the student network parameters and smooths the teacher network parameters using an exponential moving average until the updated joint training architecture converges on the target domain sample set.
[0206] Here, after the small-sample fine-tuning converges, to fully exploit the value of environmental data, a closed-loop progressive self-training is performed on the fine-tuned joint training architecture based on a massive amount of unlabeled samples in the target domain. Highly reliable pseudo-labels are generated through dynamic confidence thresholding, and a hybrid training set is constructed for closed-loop iterative optimization. The teacher network parameters are then smoothly updated using exponential moving average (EMA) to avoid pseudo-label noise interference and model overfitting until the model accuracy converges.
[0207] In detail, in each training iteration, a massive number of unlabeled food images from the target domain sample set are input into a teacher network with frozen weights for forward inference. The teacher network, based on a dual Riemannian manifold processing module and a Riemannian metric Softmax classifier, outputs the predicted probability distribution of the unlabeled samples' categories. The category corresponding to the highest probability can be selected as the candidate pseudo-label for that sample. Furthermore, a dynamic confidence threshold filtering strategy is used to verify the candidate pseudo-labels, eliminating those with low confidence or prone to introducing errors, and retaining only highly reliable candidate pseudo-labels. For example, a dynamic filtering rule based on the lowest confidence threshold and / or Top-K ranking within the category can be used. The unlabeled samples with highly reliable pseudo-labels after filtering are concatenated and fused with the labeled samples from the target domain sample set to form a hybrid training set, which serves as the training data for this iteration. The hybrid training set balances a small number of accurately labeled samples with a massive number of highly reliable pseudo-label samples, significantly expanding the scale of the target domain training data without increasing manual annotation costs, thus adapting to the learning scenarios of small samples across domains on smart terminals. The mixed training set is re-input into the fine-tuned joint training architecture, and forward propagation and total loss calculation are performed. Only the student network parameters are updated based on the backpropagation algorithm; the teacher network does not participate in gradient backpropagation and is smoothly updated by the student network parameters using exponential moving average (EMA). The formula for the exponential moving average in this stage is:
[0208] .
[0209] in, These are the teacher network parameters for round t and round t-1, respectively. Let be the student network parameters for round t. a This is the momentum coefficient. This effectively filters out noise gradients caused by false labels, ensuring stable updates to the teacher's network parameters. Repeat the above iterative process to continuously optimize the joint training architecture until the updated joint training architecture's recognition accuracy curve on the target domain sample set becomes stable and without significant fluctuations. At this point, the model is considered converged, and progressive self-training is complete.
[0210] In this way, through closed-loop progressive self-training, without destroying the topological structure of the source domain features or increasing additional computational overhead, the domain offset features of the real-world scenarios of smart terminals are deeply adapted, thereby achieving continuous improvement in model recognition accuracy and robustness.
[0211] Optionally, in step S231, the processor uses the teacher network to infer unlabeled samples in the target domain sample set to generate candidate pseudo-labels, including:
[0212] The processor uses the teacher network to infer the unlabeled samples in the target domain sample set to obtain the predicted probability distribution.
[0213] The processor extracts the maximum probability and its corresponding category from the classification prediction probability distribution, and uses the category corresponding to the maximum probability as the candidate pseudo-label for the unlabeled sample.
[0214] Here, unlabeled food images from the target domain sample set are input into the teacher network for forward inference. The unlabeled samples are sequentially processed by the teacher network's feature extraction module and dual Riemannian manifold processing module to complete feature representation and fusion. Finally, the Riemannian metric Softmax classifier outputs the predicted probability distribution for all food categories corresponding to that sample. Let the unlabeled sample be... x i If the total number of food categories is K, then the predicted probability distribution of the teacher's network output is: This formula represents unlabeled samples as... x i The probability of being predicted as the k-th type of food. The predicted probability distribution output by the teacher network is traversed, and the maximum probability value and the corresponding food category are extracted. This food category corresponding to the maximum probability is used as a candidate pseudo-label for the current unlabeled sample. The maximum probability is calculated as follows: .
[0215] Optionally, in S232, the processor verifies candidate pseudo-labels based on a dynamic confidence threshold filtering strategy, including:
[0216] For any unlabeled sample, the processor obtains the maximum probability value in the corresponding predicted probability distribution. p i .
[0217] when p i ≥ At that time, the processor retains the candidate pseudo-labels for the sample.
[0218] in, The global minimum confidence threshold increases non-linearly with the number of iterations, approaching a high confidence level as the training progresses; t represents the current training round.
[0219] Here, for each unlabeled sample whose candidate pseudo-labels are obtained through teacher network inference, the maximum probability value is extracted from its corresponding predicted probability distribution. p i A global minimum confidence threshold is set that increases non-linearly with the number of iterations. t represents the current training epoch. In the initial self-training phase, the student network has limited adaptability to the target domain features, and the threshold... At a relatively low level, sufficient reliable samples are selected for training. As the number of training rounds t increases, the student network's learning of the target domain's food characteristics deepens, and the prediction accuracy continuously improves, with the threshold... The confidence level gradually increases non-linearly and eventually approaches a preset high confidence level. As an example, The initial value is 0.6.
[0220] The maximum predicted probability value of unlabeled samples p i Dynamic threshold of the current round Compare. If p i ≥ If the threshold is high enough, the candidate pseudo-label is deemed to have sufficient confidence and is retained. Conversely, if the threshold is low enough, the candidate pseudo-label is deemed to have insufficient confidence and a high risk of error, and both the candidate pseudo-label and its corresponding sample are discarded. In this way, the dynamic threshold verification strategy can gradually increase the pseudo-label admission standard as training progresses. This avoids both insufficient usable samples due to excessively high thresholds in the early stages of training and the introduction of low-confidence erroneous labels in the later stages of training, thus preventing confirmation bias and providing continuous and reliable label data support for progressive self-training.
[0221] Optionally, in S232, the processor verifies candidate pseudo-labels based on a dynamic confidence threshold filtering strategy, and further includes:
[0222] The processor checks the confidence level of the unlabeled samples of the retained candidate pseudo-labels relative to the ranking of the confidence level of the unlabeled samples in the current training batch among all samples predicted as the same candidate class.
[0223] When the confidence level ranks in the top K% of similar candidate sets, the processor confirms the corresponding candidate pseudo-label as a highly reliable pseudo-label.
[0224] Where K is a hyperparameter with a fixed proportion.
[0225] Here, based on the global dynamic confidence threshold screening, a dual screening mechanism is formed by combining the category-inherent confidence relative ranking verification with the global minimum confidence threshold and the Top-K ranking within the category. This further improves the reliability of pseudo-labels, alleviates the problem of uneven distribution of food category samples, and also takes into account the sample mining of long-tail rare food categories. The embodiments of this disclosure design a joint mathematical indicator function. The distribution-aware dynamic filter integrates global dynamic confidence threshold constraints and intra-category relative ranking constraints to achieve dual verification of pseudo-labels. (Candidate pseudo-labels) The discriminant assignment rule is as follows:
[0226]
[0227] in, This represents the i-th sample whose predicted probability ranks in the top K% of category j. argmax This is the index of the largest independent variable.
[0228] In detail, for those that have passed the global minimum confidence threshold Unlabeled samples with candidate pseudo-labels are validated and retained. They are then grouped according to the food category to which their corresponding candidate pseudo-labels belong. All unlabeled samples predicted to belong to the same category are grouped into a single candidate sample set for relative confidence comparison within each category. For each candidate sample set, the maximum predicted probability values of all samples within the set are sorted in descending order to obtain a ranking sequence of sample confidence for that category. A fixed proportion of hyperparameter K is pre-set. In each candidate sample set, only samples with confidence ranking in the top K% are retained, and their candidate pseudo-labels are officially confirmed as highly reliable pseudo-labels. Samples not ranking in the top K% are discarded even if they meet the global confidence threshold and are no longer included in subsequent self-training.
[0229] In this way, by using relative ranking within the categories, we can avoid the situation where there are no samples available for a certain food category due to the overall low confidence level of that category. It also effectively suppresses label noise caused by low-confidence samples within the same category. While ensuring the overall reliability of pseudo-labels, it maintains a relative balance in the number of samples across categories, enabling the student network to learn a balanced approach to common and rare food ingredients during its progressive self-training process.
[0230] Optionally, in step S233, the processor inputs samples from the mixed training set into the fine-tuned joint training architecture for training, updating the student network parameters, including:
[0231] The processor trains the fine-tuned student network using samples from the mixed training set and recalculates the prototype clustering loss, classification loss, and distillation loss.
[0232] The processor performs a weighted summation of the recalculated prototype clustering loss, classification loss, and distillation loss to form a new total loss function, and calculates the gradient of the new total loss function with respect to the trainable weights of each layer of the student network.
[0233] An adaptive optimizer was used to iteratively update the weight parameters of the feature extraction module, the bi-Riemannian manifold processing module, and the classification layer of the fine-tuned student network.
[0234] Here, the samples from the mixed training set are re-input into the fine-tuned joint training architecture for progressive self-training and updating of the student network parameters, thereby enhancing the student network's target domain adaptation capability and continuously optimizing classification accuracy. This stage continues the dual-constraint optimization logic of the fine-tuning stage, recalculating various losses based on the mixed training set, constructing the total loss function, and completing the gradient update of the student network.
[0235] A mixed training set containing labeled samples from the target domain and unlabeled samples with highly reliable pseudo-labels is input into the fine-tuned joint training architecture, and a complete forward propagation is performed only through the student network. Samples are sequentially processed by the feature extraction module, the bi-Riemannian manifold processing module, and the Riemann metric Softmax classifier to output prediction results. Cross-entropy loss, prototype contrast loss, and distillation loss are recalculated, and these recalculated losses are assigned preset hyperparameter weights. A new total loss function is constructed for the self-training phase through weighted summation. Based on the newly constructed total loss function, the gradient of the total loss function relative to the trainable weights of each layer of the student network is calculated using the backpropagation algorithm. Gradient calculation is performed only on the trainable parameters of the student network; the teacher network in the joint training architecture does not participate in gradient backpropagation but only serves as a reference benchmark for the source domain feature topology, ensuring that gradient updates only affect the target domain adaptation optimization of the student network. An adaptive optimizer, such as the Adaptive Moment Estimator (Adam), can be used to iteratively update all trainable weight parameters of the fine-tuned student network's feature extraction module, bi-Riemannian manifold processing module, and classification layer based on the calculated gradient values. The adaptive optimizer can dynamically adjust the parameter update step size, which improves the accuracy of food classification in the target domain of the student network while maintaining the bi-Riemannian manifold feature representation effect and the source domain feature topology until the student network parameters converge smoothly.
[0236] Optionally, S233, the teacher network parameters are smoothed using an exponential moving average (EMA), including:
[0237] The processor inputs the parameters from the previous training round of the teacher network and the updated parameters from the current training round of the student network into the exponential moving average calculation formula to smoothly fuse the weights, thus obtaining the updated parameters of the teacher network for the current round.
[0238] The processor freezes the updated teacher network parameters as a benchmark for calculating topological constraint loss in the next round of training.
[0239] Here, the updated teacher network parameters are smoothly updated using the EMA (Effective Modulation) method to suppress false label noise interference, maintain the stability of teacher network parameters, and ensure the consistency of topology constraints. Smooth parameter fusion avoids abrupt changes in teacher network parameters, while the updated teacher network parameters are frozen as the topology constraint benchmark, ensuring that the student network always optimizes with stable source domain topology characteristics.
[0240] The fixed parameters of the teacher network after the previous training round are simultaneously input into the formula for calculating the exponential moving average, along with the parameters of the student network updated by the gradient of the total loss function in the current round. This process smoothly merges the two types of parameters to obtain the updated parameters of the teacher network in the current round. The formula for calculating the exponential moving average is given above. The momentum coefficient α can be set to 0.999, which effectively filters out noisy gradients caused by false-label samples, ensuring smooth iteration of the teacher network parameters.
[0241] The parameters of the teacher network, obtained through exponential moving average smoothing and fusion, are weighted and frozen after the current round, preventing them from participating in gradient backpropagation in the current and next training rounds. These frozen teacher network parameters will serve as the sole benchmark for calculating the topological constraint loss in the next training iteration, used to extract stable feature topological relationships and compare their relative geometric relationships with those of the student network features. This constrains the student network to continuously retain the robust feature topological structure learned during pre-training in the source domain, preventing the topological constraints from failing due to drastic fluctuations in the teacher network parameters, and ensuring the stability and convergence of the cross-domain small-sample self-training process.
[0242] Optionally, the total loss function L total for .
[0243] in, λ 1 represents the prototype comparison loss weight. λ 2 represents the distillation loss weight. Cross-entropy loss. L CE The correctness of ingredient classification is directly determined by the core optimization objective and principal loss of the student network, hence its weight is 1. During both the fine-tuning and progressive self-training phases, the total loss function uses the function from the formula above.
[0244] Combination Figure 9 Optionally, in step S204, after the processor has progressively self-trained to convergence, and after removing the teacher network from the joint training architecture and retaining only the student network's forward propagation path, the process further includes:
[0245] S205, the processor inputs the samples from the calibration set into the retained student network to complete the forward propagation, statistically analyzes the numerical distribution characteristics of the weights and activation values of each convolutional layer and fully connected layer of the student network, and calculates the quantization scaling factor and zero-point drift value of each layer.
[0246] S206, the processor performs lossless quantization transformation on the weights and activation values of the retained student network based on the affine quantization mapping formula, combined with the quantization scaling factor and zero-point drift value, compressing the floating-point numbers of the weights and activation values into integers; thus obtaining a lightweight student network model.
[0247] The calibration set is composed of samples selected from the target domain dataset.
[0248] Here, lossless quantization based on the target domain data distribution is performed on the retained student network to further compress the model size and reduce inference computational consumption, enabling the final model to efficiently adapt to the low-storage, low-computing-power operating environment of embedded hardware at the edge of smart terminals. Specifically, representative food image samples with scene, category, and pose characteristics are selected from the target domain dataset to form a calibration set. This calibration set, derived from the target domain sample set, ensures that subsequent quantization parameters conform to the target domain features, avoiding large quantization errors introduced by differences in data distribution. The samples in the calibration set are input into the student network that retains only the forward propagation path, and a complete forward propagation is completed. During inference, the weight distribution of each convolutional layer and fully connected layer, as well as the activation value distribution characteristics of the forward output of each layer, are statistically analyzed, including the value range, extreme values, and distribution intervals. Based on the statistically obtained numerical distribution characteristics, the quantization scaling factor and zero-point drift value corresponding to each layer are calculated.
[0249] Based on a predefined affine quantization mapping formula, and combined with the independent quantization scaling factor S and zero-point drift value Z for each layer, lossless quantization transformation is performed on the retained student network weights and activation values, compressing the weights and activation values, originally stored as floating-point numbers, into integer data. The predefined affine quantization mapping formula is as follows:
[0250] .
[0251] Where q is the quantization value and r is the floating-point number. The rounding function is used; the scaling factor S and the zero-point drift value Z are both non-fixed hyperparameters calculated by statistically analyzing the data distribution of each layer of the student network using the target domain calibration set (food image samples from real-world scenes on smart terminals).
[0252] This quantization process significantly reduces the model's storage space and the amount of floating-point operations at the edge, while ensuring that the accuracy of food identification is basically unaffected. This allows the lightweight student network model to be directly deployed on embedded chips in smart terminals, enabling stable and fast localized food identification.
[0253] Combination Figure 10 As shown, this disclosure provides a cross-domain few-sample object image classification device 100 for a smart terminal, including a processor 101 and a memory 102. Optionally, the device may further include a communication interface 103 and a bus 104. The processor 101, communication interface 103, and memory 102 can communicate with each other via the bus 104. The communication interface 103 can be used for information transmission. The processor 101 can call logical instructions in the memory 102 to execute the cross-domain few-sample object image classification method for a smart terminal described in the above embodiment.
[0254] Furthermore, the logical instructions in the aforementioned memory 102 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0255] The memory 102, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this disclosure. The processor 101 executes functional applications and data processing by running the program instructions / modules stored in the memory 102, that is, it implements the cross-domain small sample object image classification method for smart terminals in the above embodiments.
[0256] The memory 102 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 102 may include high-speed random access memory and may also include non-volatile memory.
[0257] Combination Figure 11 As shown, this disclosure provides a smart terminal 200, including: a smart terminal body, with a built-in image acquisition device for acquiring object images; and the aforementioned cross-domain small-sample object image classification device 100 for the smart terminal. The cross-domain small-sample object image classification device 100 for the smart terminal is installed in the smart terminal 200. The installation relationship described herein is not limited to placement within the smart terminal body, but also includes installation connections with other components of the smart terminal 200, including but not limited to physical connections, electrical connections, or signal transmission connections. Those skilled in the art will understand that the cross-domain small-sample object image classification device 100 for the smart terminal can be adapted to any feasible smart terminal body, thereby realizing other feasible embodiments. Optionally, the smart terminal includes a smart refrigerator, a smart freezer, or a smart preservation cabinet.
[0258] This disclosure provides a computer-readable storage medium storing computer-executable instructions configured to perform the above-described cross-domain small sample object image classification method for smart terminals.
[0259] The technical solutions of this disclosure can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes one or more instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the method described in this disclosure. The aforementioned storage medium can be a non-transitory storage medium, such as a USB flash drive, external hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, or other media capable of storing program code.
[0260] The foregoing description and accompanying drawings fully illustrate embodiments of this disclosure to enable those skilled in the art to practice them. Other embodiments may include structural, logical, electrical, procedural, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operation may vary. Parts and features of some embodiments may be included in or replace parts and features of other embodiments. Moreover, the terminology used in this application is for describing embodiments only and is not intended to limit the claims. As used in the description of embodiments and claims, the singular forms “a,” “an,” and “the” are intended to equally include the plural forms unless the context clearly indicates otherwise. Similarly, the term “and / or” as used in this application means including one or more of the associated listed items and all possible combinations thereof. Additionally, when used in this application, the term "comprise" and its variations "comprises" and / or "comprising" refer to the presence of stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Without further limitations, an element defined by the phrase "comprises a..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes said element. In this document, each embodiment may focus on the differences from other embodiments, and similar or identical parts between embodiments can be referred to mutually. For methods, products, etc., disclosed in the embodiments, if they correspond to the method section disclosed in the embodiments, the relevant parts can be referred to the description of the method section.
[0261] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this disclosure. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0262] The methods and products (including but not limited to devices and equipment) disclosed in the embodiments herein can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of units may be merely a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the coupling or direct coupling or communication connection shown or discussed between each other may be through some interfaces, and the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to implement this embodiment according to actual needs. In addition, the functional units in the embodiments of this disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
[0263] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than that shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. In the descriptions corresponding to the flowcharts and block diagrams in the accompanying drawings, the operations or steps corresponding to different blocks may also occur in a different order than disclosed in the description, and sometimes there is no specific order between different operations or steps. For example, two consecutive operations or steps may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. Each block in a block diagram and / or flowchart, and combinations of blocks in a block diagram and / or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
Claims
1. A cross-domain few-sample object image classification method for smart terminals, characterized in that, include: Acquire an image of the object to be identified; The feature extraction module is used to process the image of the object to be identified and extract the basic feature vector. The basic feature vectors are processed and fused using a dual Riemannian manifold processing module to generate a comprehensive feature representation. Specifically, the basic feature vectors are mapped using the Euclidean space branch and the hyperbolic space branch of the dual Riemannian manifold processing module, and the mapped features are fused together. Euclidean space features have translation invariance and can clearly define the global decision boundary of object categories. Hyperbolic space features can be embedded into the tree-like hierarchical structure of objects in a low-distortion manner. Based on comprehensive feature representation, the object classification and recognition results are output through classification layer processing. The lightweight student network model comprises a feature extraction module, a dual Riemannian manifold processing module, and a classification layer. This lightweight student network model is obtained by jointly training with topology-preserving relational knowledge distillation and progressive self-training, followed by the removal of the teacher network. The lightweight student network model is obtained through the following methods: The constructed joint training architecture, which includes teacher and student networks, is initialized using the source domain sample set; Based on metric learning, momentum prototype alignment and topology-preserving knowledge distillation are used to fine-tune the initialized joint training architecture to convergence using labeled samples from the target domain sample set. Specifically, based on labeled samples from the target domain, the prototype center of the bi-Riemannian manifold processing module is updated through a momentum prototype alignment mechanism to obtain a stable bi-Riemannian manifold processing module. Based on the prototype centers of the stable bi-Riemannian manifold processing module, cross-entropy loss and prototype contrast loss are constructed. Distillation loss is constructed based on topology-preserving knowledge distillation. Finally, the cross-entropy loss, prototype contrast loss, and distillation loss are weighted and calculated. The total loss function is used as the sum of the sums. Based on the total loss function, the backpropagation algorithm is executed to calculate the joint gradient with respect to the weights of each layer of the student network to update the student network. The cross-entropy loss is constructed by calculating the Euclidean L2 distance and hyperbolic geodesic distance between the sample bispatial features and the corresponding class prototype centers, respectively. Based on the Riemann metric prototype Softmax classifier, the Euclidean L2 distance and hyperbolic geodesic distance are converted into classification prediction probabilities through a negative exponential distribution. The cross-entropy is calculated based on the prediction probabilities and the sample true labels to obtain the cross-entropy loss. The bispatial features include Euclidean space features and hyperbolic space features. The fine-tuned joint training architecture is progressively self-trained using the target domain sample set; After progressive self-training to convergence, the teacher network in the joint training architecture is removed, and only the forward propagation path of the student network is retained to obtain a lightweight student network model.
2. The method according to claim 1, characterized in that, Based on topology-preserving knowledge distillation, a distillation loss is constructed, including: Input the current training batch samples into the student network to obtain the student feature set; and input the current training batch samples into the teacher network to obtain the teacher feature set. Based on the teacher feature set and the student feature set, the Euclidean space distance and hyperbolic geodesic distance of any sample pair are calculated respectively to construct the distance distillation loss; Based on the teacher feature set and the student feature set, the cosine similarity of any sample to a unit feature is calculated, and the angular distillation loss is constructed. The distillation loss is obtained by summing the distance distillation loss and the angle distillation loss.
3. The method according to claim 2, characterized in that, Based on teacher and student feature sets, the Euclidean spatial distance and hyperbolic geodesic distance for any sample pair are calculated respectively to construct the distance distillation loss, including: Calculate the Euclidean distance and hyperbolic geodesic distance between any pair of samples in the teacher feature set and the student feature set, respectively; The calculated Euclidean space distance and hyperbolic geodesic distance are normalized by mean to obtain the normalized Euclidean space relative distance and hyperbolic geodesic relative distance. The Huber loss function is used to constrain the consistency of the distribution of normalized Euclidean space relative distance and hyperbolic geodesic relative distance, and a distance distillation loss is constructed.
4. The method according to claim 2, characterized in that, Based on the teacher feature set and the student feature set, the cosine similarity of any sample to a unit feature is calculated, and an angular distillation loss is constructed, including: The bispace features in the teacher feature set and the student feature set are normalized to obtain unit bispace features; Calculate the cosine similarity between any sample in the teacher feature set and the student feature set and the unit bispace feature, respectively; An angular distillation loss is constructed by constraining the distribution consistency of cosine similarity using the mean square error loss function. Among them, dual-space features include Euclidean space features and hyperbolic space features.
5. The method according to claim 1, characterized in that, The fine-tuned joint training architecture is progressively self-trained using a target domain sample set, including... In each training iteration, the teacher network is used to infer unlabeled samples in the target domain sample set to generate candidate pseudo-labels; Based on a dynamic confidence threshold screening strategy, candidate pseudo-labels are verified to obtain highly reliable pseudo-labels; unlabeled samples with highly reliable pseudo-labels are mixed with labeled samples to form a mixed training set; The samples from the mixed training set are input into the fine-tuned joint training architecture for training. The student network parameters are updated and the teacher network parameters are smoothed by an exponential moving average until the updated joint training architecture converges on the target domain sample set.
6. The method according to claim 5, characterized in that, Using a teacher network, inference is performed on unlabeled samples in the target domain sample set to generate candidate pseudo-labels, including: The teacher network is used to infer the predicted probability distribution from the unlabeled samples in the target domain sample set. Extract the maximum probability and its corresponding category from the classification prediction probability distribution, and use the category corresponding to the maximum probability as the candidate pseudo-label for the unlabeled sample.
7. The method according to claim 6, characterized in that, Based on a dynamic confidence threshold screening strategy, candidate pseudo-labels are validated, including: For any unlabeled sample, obtain the maximum probability value in the corresponding predicted probability distribution. p i ; when p i ≥ At that time, retain the candidate pseudo-labels for the sample; in, The global minimum confidence threshold is a non-linearly increasing threshold that increases with the number of iterations, where t represents the current training round.
8. The method according to claim 7, characterized in that, The dynamic confidence threshold filtering strategy for verifying candidate pseudo-labels also includes: For unlabeled samples with retained candidate pseudo-labels, check the relative ranking of the confidence of the unlabeled samples among all samples predicted as the same candidate class in the current training batch. When the confidence level ranks in the top K% of similar candidate sets, the corresponding candidate pseudo-label is confirmed as a highly reliable pseudo-label. Where K is a hyperparameter with a fixed proportion.
9. The method according to claim 5, characterized in that, The samples from the mixed training set are input into the fine-tuned joint training architecture for training, updating the student network parameters, including: The fine-tuned student network was trained using samples from the mixed training set, and the cross-entropy loss, prototype contrast loss, and distillation loss were recalculated. A new total loss function is constructed by weighted summation of the recalculated cross-entropy loss, prototype contrast loss, and distillation loss. The gradient of the new total loss function with respect to the trainable weights of each layer of the student network is then calculated. An adaptive optimizer was used to iteratively update the weight parameters of the feature extraction module, the bi-Riemannian manifold processing module, and the classification layer of the fine-tuned student network.
10. The method according to claim 9, characterized in that, Teacher network parameters were smoothed using an exponential moving average, including: The parameters from the previous training round of the teacher network and the updated parameters from the current training round of the student network are input into the exponential moving average calculation formula for smooth fusion of weights to obtain the updated parameters of the teacher network in the current round. The updated teacher network parameters are frozen and used as the benchmark for calculating distillation loss in the next round of training.
11. The method according to claim 1, characterized in that, After removing the teacher network from the joint training architecture and retaining only the forward propagation path of the student network, it also includes: The samples in the calibration set are input into the retained student network to complete the forward propagation. The numerical distribution characteristics of the weights and activation values of each convolutional layer and fully connected layer of the student network are statistically analyzed, and the quantization scaling factor and zero-point drift value of each layer are calculated. Based on the affine quantization mapping formula, lossless quantization transformation is performed on the weights and activation values of the retained student network by combining the quantization scaling factor and the zero-point drift value, compressing the floating-point numbers of the weights and activation values into integers; The calibration set is composed of samples selected from the target domain dataset.
12. A cross-domain small-sample object image classification device for a smart terminal, comprising a processor and a memory storing program instructions, characterized in that, The processor is configured to execute, when running the program instructions, the cross-domain few-sample object image classification method for a smart terminal as described in any one of claims 1 to 11.