An intelligent terminal streaming image recognition method based on hierarchical semantic anchor

By constructing a hierarchical semantic tree on a smart terminal and utilizing hyperbolic space geometric constraints and null gradient projection techniques, the problem of fine-grained feature drift and forgetting in streaming image recognition on smart terminals is solved, thereby improving recognition accuracy and stability.

CN122289631APending Publication Date: 2026-06-26NANJING UNIV +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING UNIV
Filing Date
2026-04-01
Publication Date
2026-06-26

Smart Images

  • Figure CN122289631A_ABST
    Figure CN122289631A_ABST
Patent Text Reader

Abstract

This invention discloses a streaming image recognition method for smart terminals based on hierarchical semantic anchoring. First, streaming image data is acquired from the smart terminal. A hierarchical semantic tree of the categories to be recognized is constructed using a large language model to obtain the parent-child hierarchical relationships between categories. A recognition model including a visual encoder and a text encoder is constructed, and a hyperbolic mapping module is introduced to map features from Euclidean space to hyperbolic space. During training, a cone constraint is applied in hyperbolic space using the hierarchical semantic tree to ensure that subclass features are contained within the geometric cone of the parent class, thus anchoring semantic features and preventing drift. During incremental task learning, the update gradient of the model parameters is projected onto the null space of the old task features. Finally, during the inference stage, the prediction results from Euclidean space and hyperbolic space are combined to output the final classification. This invention effectively solves the catastrophic forgetting problem caused by fine-grained feature drift in streaming learning, and improves the recognition accuracy and stability of the model in open and dynamic scenes by utilizing a hierarchical structure.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a streaming image recognition method for smart terminals based on hierarchical semantic anchoring, belonging to the field of computer vision and artificial intelligence technology. Background Technology

[0002] With the widespread adoption of mobile smart terminals (such as smartphones, AR glasses, and home service robots), these devices are constantly exposed to new image data during daily operation. For example, photo album applications need to identify newly photographed pets, products, or scenic spots. This data arrives in a streaming manner, and the categories dynamically increase over time. Faced with massive amounts of streaming data, traditional deep learning methods typically require saving all historical data and retraining the model, which is impractical on mobile devices with limited computing resources and extremely high privacy requirements. The computing power bottleneck and privacy demands of mobile smart terminals further amplify the shortcomings of traditional methods. On the one hand, the processor performance and memory capacity of these devices are far inferior to those of cloud servers, and retraining with full data will cause prolonged device lag and soaring power consumption, severely impacting the user experience. On the other hand, image data such as pet photos and shopping receipts taken by users often contain sensitive information, and uploading them to the cloud for training poses a risk of privacy leakage. Therefore, edge-side incremental learning technology that does not rely on historical data and cloud computing power has become a core research direction in the field of mobile visual recognition.

[0003] In recent years, incremental class learning methods based on pre-trained models (such as CLIP) have attracted attention, aiming to enable models to learn new categories without forgetting old knowledge. However, most existing methods fine-tune features in a flat Euclidean space, ignoring the hierarchical semantic relationships that naturally exist between object categories in the real world (e.g., "Husky" and "Golden Retriever" both belong to "dog"). In streaming learning, this lack of hierarchical structure leads to the easy drift of fine-grained subclass features, that is, newly learned category features can encroach on or confuse the feature space of old categories. For example, first recognize the parent class "dog", and then distinguish homogeneous subclasses such as "Husky" and "Golden Retriever" through features such as fur and body shape. However, existing incremental learning models with flat spatial structures treat all categories as isolated individuals. When learning about "Husky," they cannot leverage the general features of "dog" for association constraints. This causes the model's learning of new category features to remain at a superficial matching level. Not only does it easily overlap with the feature spaces of similar subcategories like "Golden Retriever," but it also gradually covers the core features of the parent category "dog," ultimately leading to the absurd phenomenon of "being able to recognize a Husky but forgetting what a dog is." This lack of hierarchical structure will ultimately result in severe and catastrophic forgetting, affecting the recognition accuracy of smart terminals in open and dynamic environments.

[0004] Therefore, there is an urgent need for a streaming image recognition method for smart terminals based on hierarchical semantic anchoring, which can improve the accuracy of product click prediction. Summary of the Invention

[0005] The summary section of this application is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.

[0006] To address the problems and shortcomings of existing technologies, this invention aims to provide a streaming image recognition method for intelligent terminals based on hierarchical semantic anchoring. Addressing the issues of existing streaming image recognition methods neglecting hierarchical relationships and being prone to feature drift leading to catastrophic forgetting, this invention provides a streaming image recognition method for intelligent terminals based on hierarchical semantic anchoring. This method utilizes a large language model to construct a hierarchical semantic tree and anchors features through hyperbolic space geometric constraints. Simultaneously, it incorporates null gradient projection technology to achieve efficient, accurate, and forgetting-resistant continuous learning on mobile devices, thereby solving the problems mentioned in the background.

[0007] To achieve the above objectives, the present invention provides the following technical solution: As a first aspect of this application, the present invention discloses a streaming image recognition method for smart terminals based on hierarchical semantic anchoring, comprising the following steps: Step 1: During the operation of the smart terminal, continuously acquire streaming image data and corresponding category labels, and perform preprocessing to obtain the base task dataset. and incremental task dataset ; Step 2, Load the pre-trained visual-language model Initialize the task-specific hierarchical perception module and globally shared hyperbolic mapping layer It then invokes a large language model to construct a hierarchical semantic tree based on the category names of the current task. ; Step 3, the base task dataset and incremental task dataset Input Vision-Language Model During training, a hyperbolic mapping layer is introduced to map the aggregated Euclidean space features to hyperbolic space, based on the hierarchical semantic tree. By using conical geometric constraints, the neutron class features are anchored within the conical region of the parent class features, and the hyperbolic contrast loss and hierarchical implication loss are calculated. Step 4: When updating the model parameters, calculate the null space of the old task features, project the gradient of the new task onto the null space, and save the model parameters. Step 5: Obtain the image data to be recognized and input it into the trained vision-language model. The prediction results for Euclidean space and hyperbolic space are integrated to obtain the prediction results and then fed back to the user.

[0008] Preferably, in step 1, a certain amount of image data and corresponding category labels are first collected on the smart terminal, and the base task dataset is obtained after preprocessing. Then, newly arriving image data and corresponding category labels are continuously collected, and incremental task data is obtained after preprocessing. The preprocessing includes operations such as resizing, normalizing, and enhancing the collected product image data.

[0009] Preferably, step 3 further includes the following steps: Step 3.1: Freeze the hierarchical awareness module of old tasks for the current task. The optimizer is used to train only the hierarchical perception module for the current task. ; Step 3.2: Input the image of the current task into the frozen visual encoder. Obtain visual features Input to the hierarchical perception module The features are added together to obtain the aggregated Euclidean space features. ; Step 3.3, aggregate the Euclidean space features Input hyperbolic mapping layer Using hyperbolic mapping layers Mapping to hyperbolic space yields a hyperbolic embedding vector; Step 3.4, based on the hierarchical semantic tree The hyperbolic embedding vector of a constrained subclass node lies within a cone-shaped region defined by its parent class node, and the hierarchical implication loss is calculated in hyperbolic space. ; Step 3.5, simultaneously calculate hyperbolic contrast loss. To bring matching image-text pairs closer together and push away mismatched image-text pairs.

[0010] Preferably, in step 3, the hierarchical semantic tree Constructed from the aforementioned large language model, the hierarchical relationships between categories are automatically inferred by inputting category labels, generating a tree-structured data containing a root node, abstract parent class nodes, and child class nodes; the conical region refers to the range of the subtended angle defined by the parent class node in hyperbolic geometry, and the hierarchical entailment loss... The hyperbolic embedding vector of a subclass node is constrained to lie within the cone-shaped region defined by its parent node, thus forcing the subclass semantics to be contained within the parent semantics, thereby anchoring features and preventing drift.

[0011] Preferably, step 4 further includes the following steps: Step 4.1, calculate the input to the hyperbolic mapping layer. The non-central covariance matrix of the previous Euclidean features , Singular value decomposition; Step 4.2: Obtain the orthogonal complement matrix by selecting the smallest index whose accumulated energy exceeds a set threshold. The null space serves as a characteristic of the old task; Step 4.3, utilize the hyperbolic mapping layer Update the original gradient with parameters Projected onto the zero space In this case, only gradient components orthogonal to the changes in the old task features are retained for parameter updates.

[0012] Preferably, step 5 further includes the following steps: Step 5.1: Obtain the image data to be classified, preprocess it, and then input it into the trained and updated visual-language model. ; Step 5.2: Calculate the similarity prediction probability based on cosine similarity in Euclidean space. And the negative hyperbolic prediction probability based on negative hyperbolic distance in hyperbolic space. ; Step 5.3, predict the similarity probability. Sum of negative hyperbolic prediction probabilities Integration yields prediction results ; Step 5.4: Select the prediction result. The category with the highest probability is used as the final classification result and fed back to the user.

[0013] Preferably, the hierarchical entailment loss in steps 3.4 and 3.5 Loss compared to hyperbolic They are respectively represented as, in, Represented as the cone half of the parent node This indicates the outer angle of a subclass node that deviates from the boundary of the cone. This is represented as the probability of matching an image to text. This represents the probability of a text-to-image match. and These are the first in the current batch Hyperbolic visual features and textual features of a positive sample image-text pair. For hyperbolic distance, The total number of samples within the batch. Temperature parameters used to control the smoothness of the distribution.

[0014] Preferably, the similarity prediction probability in step 5.2 Sum of negative hyperbolic prediction probabilities They are represented as follows: in, Represented as image features, Represented as the category center, Represented as negative hyperbolic distance. Represented as a hyperbolic feature, This is represented as a hyperbolic feature of the text prototype.

[0015] As a second aspect of this application, the present invention also discloses an electronic device, comprising: At least one processor, and a memory communicatively connected to said at least one processor; The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enable the at least one processor to perform the steps described above in the hierarchical semantic anchoring-based intelligent terminal streaming image recognition method.

[0016] As a third aspect of this application, the present invention also discloses a computer storage medium storing a computer program thereon, characterized in that the computer program, when executed by a processor, implements the steps of the above-described streaming image recognition method for intelligent terminals based on hierarchical semantic anchoring.

[0017] Compared with the prior art, the beneficial effects of the present invention are as follows: This invention discloses a streaming image recognition method for smart terminals based on hierarchical semantic anchoring. First, during the operation of the smart terminal, streaming image data and corresponding category labels are continuously acquired and preprocessed to obtain a base task dataset and an incremental task dataset. A pre-trained vision-language model is loaded, and a task-specific hierarchical perception module and a globally shared hyperbolic mapping layer are initialized. A large language model is then invoked to construct a hierarchical semantic tree based on the category names of the current task. The base task dataset and the incremental task dataset are input into the vision-language model for training. During training, a hyperbolic mapping layer is introduced to map the aggregated Euclidean space features to hyperbolic space, based on the hierarchical semantic tree. This invention utilizes conical geometric constraints to anchor subclass features within the conical region of parent class features, calculating hyperbolic contrast loss and hierarchical implication loss. When updating model parameters, the null space of old task features is calculated, and the gradient of the new task is projected onto the null space while saving the model parameters. The image data to be recognized is input into the trained visual-language model, and the prediction results from Euclidean and hyperbolic spaces are integrated to provide the prediction result to the user. Compared to existing technologies, the hierarchical semantic anchoring-based streaming image recognition method for smart terminals provided by this invention effectively solves the drift problem of fine-grained features in incremental learning by introducing hierarchical geometric constraints in hyperbolic space. Simultaneously, by utilizing null space gradient projection technology, catastrophic forgetting is significantly reduced without storing large amounts of historical privacy data. This method not only improves the model's recognition accuracy in open scenes but also ensures the stability of model updates, making it suitable for resource-constrained environments of smart terminals. This invention effectively solves the catastrophic forgetting problem caused by fine-grained feature drift in streaming learning, and improves the recognition accuracy and stability of the model in open and dynamic scenes by utilizing a hierarchical structure. Attached Figure Description

[0018] The accompanying drawings, which form part of this application, are used to provide a further understanding of the application and to make other features, objects, and advantages of the application more apparent. The illustrative embodiments and descriptions of this application are used to explain the application and do not constitute an undue limitation of the application.

[0019] In the attached diagram: Figure 1 This is a flowchart of the main steps of the intelligent terminal streaming image recognition method based on hierarchical semantic anchoring in an embodiment of the present invention; Figure 2 This is a flowchart illustrating the steps of collecting the dataset in the intelligent terminal streaming image recognition method based on hierarchical semantic anchoring in this embodiment of the invention. Figure 3 This is a flowchart illustrating the training model steps of the intelligent terminal streaming image recognition method based on hierarchical semantic anchoring in an embodiment of the present invention. Figure 4 This is a flowchart illustrating the feedback prediction results steps of the intelligent terminal streaming image recognition method based on hierarchical semantic anchoring in an embodiment of the present invention. Detailed Implementation

[0020] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0021] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.

[0022] Example This invention discloses a streaming image recognition method for smart terminals based on hierarchical semantic anchoring. This invention is applicable to smart terminal scenarios that require continuous processing and classification of new image data, where image categories are known and may continue to increase, especially in situations with limited computing resources and the need to protect user privacy (by not storing historical images).

[0023] This disclosure will now be described in detail with reference to the accompanying drawings and embodiments. (Refer to...) Figure 1 As shown, the present invention mainly includes the following steps: Step 1: During the operation of the smart terminal, continuously acquire streaming image data and corresponding category labels, and perform preprocessing to obtain the base task dataset. and incremental task dataset ; Step 2, Load the pre-trained visual-language model Initialize the task-specific hierarchical perception module and globally shared hyperbolic mapping layer It then invokes a large language model to construct a hierarchical semantic tree based on the category names of the current task. ; Step 3, convert the base task dataset and incremental task dataset Input Vision-Language Model During training, a hyperbolic mapping layer is introduced to map the aggregated Euclidean space features to hyperbolic space, based on a hierarchical semantic tree. By using conical geometric constraints, the neutron class features are anchored within the conical region of the parent class features, and the hyperbolic contrast loss and hierarchical implication loss are calculated. Step 4: When updating the model parameters, calculate the null space of the old task features, project the gradient of the new task onto the null space, and save the model parameters. Step 5: Obtain the image data to be recognized and input it into the trained visual-language model. The prediction results for Euclidean space and hyperbolic space are integrated to obtain the prediction results and then fed back to the user.

[0024] Specifically, smart terminals include edge devices with computing capabilities such as smartphones and robots. Streaming image data refers to image data whose categories increase over time and are received in batches by the device. This invention is applicable to continuous learning scenarios on mobile devices where resources are limited and there is no need to store historical data. The following embodiment uses the automatic classification function of a smartphone's photo album as an example. Smartphones already have basic recognition capabilities when they leave the factory, which is represented as the basic task. As users use the device, they will continuously take pictures of new objects (such as specific breeds of pets, newly purchased furniture, etc.), and this new data constitutes the incremental task of streaming. The purpose of this invention is to enable the phone to learn new objects while not forgetting the objects it can recognize before, and to understand hierarchical relationships such as "Corgi" and "dog".

[0025] Reference Figure 2 As shown, for step 1, during the operation of the smart terminal, streaming image data and corresponding category labels are continuously acquired and preprocessed to obtain the base task dataset. and incremental task dataset Specifically, a certain amount of image data and its corresponding category labels are first collected on the smart terminal. The image data is then preprocessed, and the preprocessed image data is used as the base task dataset. The preprocessing methods include resizing, normalizing, and enhancing the collected product image data. Given the characteristics of streaming data, new image data is continuously introduced into the open environment during the operation of the smart terminal. The category labels corresponding to the newly arriving image data are then continuously collected to form streaming image data. Next, the collected streaming image data is preprocessed, and the processed streaming image data is used as incremental task data. This is for continuous updates to the subsequent model.

[0026] Reference Figure 3 As shown, for step 2, the pre-trained visual-language model is loaded. Initialize the task-specific hierarchical perception module and globally shared hyperbolic mapping layer It then invokes a large language model to construct a hierarchical semantic tree based on the category names of the current task. Specifically, pre-trained vision-language models This refers to a deep network that includes both visual and text encoders. First, a pre-trained visual-language model is loaded. (CLIP model) serves as the backbone network, and task-specific hierarchical perception modules are initialized. and globally shared hyperbolic mapping layer Next, the interface of a large language model (such as the GPT series) is called, inputting all category names for the current task, to generate a hierarchical semantic tree structure containing "entity-parent class-child class". For example, for the new category "corgi", the path "entity -> animal -> dog -> corgi" is generated. Hierarchical perception module Lightweight linear layers or adapters are used to fine-tune the model to suit the current task. Hyperbolic mapping layer This is used to map features from Euclidean space to a hyperbolic manifold, representing hierarchical structures using geometric properties. Additionally, hierarchical semantic trees... Built from a large language model, it automatically infers the hierarchical relationships between categories based on input category labels, generating a tree-structured data containing a root node, abstract parent class nodes, and child class nodes. This is a hierarchical semantic tree. It includes the root node "entity", the generated abstract parent class node, and the concrete child class node to clarify the parent-child hierarchy between categories.

[0027] Step 3, convert the base task dataset and incremental task dataset Input Vision-Language Model During training, a hyperbolic mapping layer is introduced to map the aggregated Euclidean space features to hyperbolic space, based on a hierarchical semantic tree. By utilizing conical geometric constraints, the subclass features are anchored within the conical region of the parent class features, and the hyperbolic contrast loss and hierarchical implication loss are calculated. Specifically, the following steps are also included: Step 3.1: Freeze the hierarchical awareness module of old tasks for the current task. The optimizer is used to train only the hierarchical perception module for the current task. ; Step 3.2: Input the image of the current task into the frozen visual encoder. Obtain visual features Input to the hierarchical perception module The features are added together to obtain the aggregated Euclidean space features. ; Step 3.3: Combine the aggregated Euclidean space features Input hyperbolic mapping layer Using hyperbolic mapping layers Mapping to hyperbolic space yields hyperbolic embedding vectors ; Step 3.4, based on hierarchical semantic tree Constraining the hyperbolic embedding vector of subclass nodes Located within the cone-shaped region defined by its parent node, the hierarchical implied loss is calculated in hyperbolic space. ; Step 3.5, simultaneously calculate hyperbolic contrast loss. To bring matching image-text pairs closer together and push away mismatched image-text pairs.

[0028] Specifically, the base task dataset and incremental task dataset Input Vision-Language Model To train the visual-language model, input the image and corresponding text description of the current task. After passing through the visual encoder and hierarchical perception module Then, through the hyperbolic mapping layer Project onto hyperbolic space. That is, for the currently training task (i.e., the currently input base task data or any incremental task data), freeze the hierarchical perception modules of all old tasks (i.e., historical tasks trained before the current task), and use optimizers such as AdamW to train only the hierarchical perception module for the current task. Input the image of the current task into the frozen visual encoder. Frozen visual features Visual features Input to the cumulative hierarchical perception module The features are added together to obtain the aggregated Euclidean space features. Then, the Euclidean space features are input into the hyperbolic mapping layer. Using hyperbolic mapping layers The aggregated features are mapped to hyperbolic space to obtain hyperbolic embedding vectors.

[0029] Based on a hierarchical semantic tree, constrain subclass nodes. The hyperbolic embedding vector is located at its parent node. Within the defined conical region, the computational hierarchy implies loss in hyperbolic space. The conical region refers to the angular range defined by the parent node in hyperbolic geometry. It mandates that child feature nodes reside within the conical region defined by the parent feature node in hyperbolic space to achieve semantic anchoring. The hierarchical implication loss constrains the subclass embedding vectors to reside within this region, thus forcing the subclass semantics to be contained within the parent class semantics, thereby anchoring features and preventing drift. in, It is represented as a soft maximum function, used to map a vector to a smooth, differentiable positive output. parent node The conical semi-aperture, is a constant and For curvature. This represents the exterior angle of a subclass node deviating from the cone boundary, calculated using the following formula: In the above formula, and These represent the midpoints of hyperbolic space (Lorentz model). Spatial and temporal components (satisfying) ), For curvature, This represents the hyperbolic inner product. Simultaneously, the hyperbolic contrast loss is calculated to bring matching image-text pairs closer together and push away mismatched pairs; the formula is expressed as follows: ; in, This represents the probability of matching an image to text. This represents the probability of matching text to an image. and These are the first in the current batch Hyperbolic visual features and textual features of a positive sample image-text pair. For hyperbolic distance, The total number of samples within the batch. Temperature parameters used to control the smoothness of the distribution.

[0030] For step 4, when updating the model parameters, the null space of the old task features is calculated, and the gradient of the new task is projected onto the null space and the model parameters are saved for use in the next step. Specifically, this also includes the following steps: Step 4.1, calculate the input hyperbolic mapping layer The non-central covariance matrix of the previous Euclidean features , Singular value decomposition; Step 4.2: Obtain the orthogonal complement matrix by selecting the smallest index whose accumulated energy exceeds a set threshold. The null space serves as a characteristic of the old task; Step 4.3, utilize hyperbolic mapping layer Update the original gradient with parameters Projected onto zero space In this case, only gradient components orthogonal to the changes in the old task features are retained for parameter updates.

[0031] Specifically, in updating the hyperbolic mapping layer When calculating the parameters, to prevent forgetting, the non-central covariance matrix of the Euclidean features before entering the hyperbolic mapping layer TP is first calculated. , represented as .in, Let be the number of samples for task j. Then... Perform singular value decomposition ,in U and V It is an orthogonal matrix. It is a singular value diagonal matrix. Its orthogonal complement matrix is ​​obtained by selecting the smallest index where the accumulated energy exceeds a set threshold. The null space serves as a feature of the old task. Hyperbolic mapping layers are utilized. TP Update the original gradient with parameters Projected onto zero space Above, the projection gradient is calculated as follows: Gradient projection This refers to restricting parameter updates to directions orthogonal to the principal changes in the old task features, thus reducing interference with existing knowledge. Only gradient components orthogonal to the changes in the old task features are used for parameter updates to protect old task knowledge from being overwritten and to ensure that parameter updates do not disrupt the mapping relationship of old categories in hyperbolic space. For example, due to the matrix... It includes the directions where the variance of the old features is close to zero, i.e., the old features satisfy Therefore, along The update will not interfere with the hyperbolic mapping layer. The existing mapping output of the old task (i.e. After training is complete, the model parameters are saved for the initialization of the next stage.

[0032] Reference Figure 4 As shown, for step 5, the image data to be recognized is input into the trained visual-language model. The prediction results for Euclidean space and hyperbolic space are integrated to obtain the prediction results feedback to the user. Specifically, this includes the following steps: Step 5.1: Obtain the image data to be classified, preprocess it, and then input it into the trained and updated visual-language model. ; Step 5.2: Calculate the similarity prediction probability based on cosine similarity in Euclidean space. And the negative hyperbolic prediction probability based on negative hyperbolic distance in hyperbolic space. ; Step 5.3, predict the similarity probability. Sum of negative hyperbolic prediction probabilities Integration yields prediction results ; Step 5.4, Select the prediction results The category with the highest probability is used as the final classification result and fed back to the user.

[0033] Specifically, acquiring the image to be classified Input the model and calculate the values ​​for each category in the two feature spaces respectively. The predicted probability. In the original Euclidean space, based on image features. With Category Center The cosine similarity between them is calculated using exponential normalization, similar to the steps described above, to obtain the similarity prediction probability. ,Right now In hyperbolic space, the hierarchical perceptual hyperbolic features of the image are... Hyperbolic features of text prototype Negative hyperbolic distance between do The negative hyperbolic prediction probability is calculated using logical values. ,Right now Predicting the probability of similarity Sum of negative hyperbolic prediction probabilities The final prediction result is obtained by integration. , represented as Finally, select the prediction results. The category with the highest probability is used as the final classification result and fed back to the user. For example, when a user takes a new photo, the image is first preprocessed and then input into the pre-trained model. The model contains two inference paths: one calculates the cosine similarity probability between image features and text features in the original Euclidean space, and the other calculates the negative hyperbolic distance probability between image embeddings and text embeddings in hyperbolic space. These two sets of probability values ​​are weighted and summed, and the category with the highest confidence is selected as the prediction result. Finally, the recognition result (such as "This is a Corgi") is displayed on the screen and fed back to the user.

[0034] To implement the above embodiments, this application also discloses an electronic device. The electronic device may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) that can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) or a program loaded from a storage device into a random access memory (RAM). Various programs and data required for the operation of the electronic device are also stored in the RAM. The processing unit, ROM, and RAM are interconnected via a bus. An input / output (I / O) interface is also connected to the bus. Typically, the following devices can be connected to the I / O interface: input devices including, for example, a touchscreen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices including, for example, a liquid crystal display (LCD), speaker, vibrator, etc.; storage devices including, for example, magnetic tape, hard disk, etc.; and communication devices. The communication device allows the electronic device to communicate wirelessly or wiredly with other devices to exchange data. Although electronic devices with various devices are shown, it should be understood that it is not required to implement or possess all of the shown devices. More or fewer devices may be implemented or possessed alternatively.

[0035] In particular, according to some embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, some embodiments of this disclosure include a computer program product comprising a computer program carried on a computer storage medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from a storage device, or installed from a ROM. When the computer program is executed by a processing device, it performs the functions defined above in the methods of some embodiments of this disclosure.

[0036] It should be noted that the computer storage medium in some embodiments of this disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.

[0037] In some embodiments of this disclosure, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in connection with an instruction execution system, apparatus, or device. In some embodiments of this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can also be any computer storage medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer storage medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0038] In other implementations, clients and servers may communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol), and may interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future-developed networks.

[0039] The aforementioned computer storage medium may be included in the aforementioned electronic device, or it may exist independently and not assembled into the electronic device. The aforementioned computer storage medium carries one or more programs, which, when executed by the electronic device, enable the electronic device to implement a streaming image recognition method for intelligent terminals based on hierarchical semantic anchoring.

[0040] Computer program code for performing operations of some embodiments of this disclosure can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0041] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. Each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions. Units described in some embodiments of the present disclosure may be implemented in software or hardware. The described units may also be located in a processor, and the names of these units do not necessarily constitute a limitation on the unit itself.

[0042] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0043] All technologies not described in detail in this invention are existing technologies. The above descriptions are merely some preferred embodiments of this disclosure and explanations of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalent features without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.

Claims

1. A hierarchical semantic anchor-based intelligent terminal streaming image recognition method, characterized in that, Includes the following steps: Step 1, in the process of running the intelligent terminal, continuously acquire streaming image data and corresponding category labels and pre-process to obtain a base task data set and an incremental task data set ; Step 2, load pre-trained visual-linguistic model , initialize task-specific hierarchical perception module and globally shared hyperbolic mapping layer , and call large language model to construct hierarchical semantic tree according to category name of current task ; Step 3, the base task dataset and the incremental task dataset input a visual-language model for training, and a hyperbolic mapping layer is introduced to map the aggregated Euclidean space features to a hyperbolic space during training, based on the hierarchical semantic tree the cone geometric constraint is used to anchor the sub-class features in the cone region of the parent class features, and the hyperbolic contrast loss and hierarchical entailment loss are calculated; Step 4: When updating the model parameters, calculate the null space of the old task features, project the gradient of the new task onto the null space, and save the model parameters. Step 5: Obtain the image data to be recognized and input it into the trained vision-language model. The prediction results for Euclidean space and hyperbolic space are integrated to obtain the prediction results and then fed back to the user.

2. The intelligent terminal streaming image recognition method based on hierarchical semantic anchoring according to claim 1, characterized in that: In step 1, a certain amount of image data and corresponding category labels are first collected on the smart terminal, and the base task dataset is obtained after preprocessing. Then, newly arriving image data and corresponding category labels are continuously collected, and incremental task data is obtained after preprocessing. The preprocessing includes operations such as resizing, normalizing, and enhancing the collected product image data.

3. The intelligent terminal streaming image recognition method based on hierarchical semantic anchoring according to claim 2, characterized in that, Step 3 also includes the following steps: Step 3.1: Freeze the hierarchical awareness module of old tasks for the current task. The optimizer is used to train only the hierarchical perception module for the current task. ; Step 3.2: Input the image of the current task into the frozen visual encoder. Obtain visual features Input to the hierarchical perception module The features are added together to obtain the aggregated Euclidean space features. ; Step 3.3, aggregate the Euclidean space features Input hyperbolic mapping layer Using hyperbolic mapping layers Mapping to hyperbolic space yields hyperbolic embedding vectors; Step 3.4, based on the hierarchical semantic tree The hyperbolic embedding vector of a constrained subclass node lies within a cone-shaped region defined by its parent class node, and the hierarchical implication loss is calculated in hyperbolic space. ; Step 3.5, simultaneously calculate hyperbolic contrast loss. To bring matching image-text pairs closer together and push away mismatched image-text pairs.

4. The intelligent terminal streaming image recognition method based on hierarchical semantic anchoring according to claim 3, characterized in that: In step 3, the hierarchical semantic tree Constructed from the aforementioned large language model, the hierarchical relationships between categories are automatically inferred by inputting category labels, generating a tree-structured data containing a root node, abstract parent class nodes, and child class nodes; the conical region refers to the range of the subtended angle defined by the parent class node in hyperbolic geometry, and the hierarchical entailment loss... The hyperbolic embedding vector of a subclass node is constrained to lie within the cone-shaped region defined by its parent node, thus forcing the subclass semantics to be contained within the parent semantics, thereby anchoring features and preventing drift.

5. The intelligent terminal streaming image recognition method based on hierarchical semantic anchoring according to claim 3, characterized in that, Step 4 also includes the following steps: Step 4.1, calculate the input to the hyperbolic mapping layer. The non-central covariance matrix of the previous Euclidean features , Singular value decomposition; Step 4.2: Obtain the orthogonal complement matrix by selecting the smallest index whose accumulated energy exceeds a set threshold. The null space serves as a characteristic of the old task; Step 4.3, utilize the hyperbolic mapping layer Update the original gradient with parameters Projected onto the zero space In this case, only gradient components orthogonal to the changes in the old task features are retained for parameter updates.

6. The intelligent terminal streaming image recognition method based on hierarchical semantic anchoring according to claim 5, characterized in that, Step 5 also includes the following steps: Step 5.1: Obtain the image data to be classified, preprocess it, and then input it into the trained and updated visual-language model. ; Step 5.2: Calculate the similarity prediction probability based on cosine similarity in Euclidean space. And the negative hyperbolic prediction probability based on negative hyperbolic distance in hyperbolic space. ; Step 5.3, predict the similarity probability. Sum of negative hyperbolic prediction probabilities Integration yields prediction results ; Step 5.4: Select the prediction result. The category with the highest probability is used as the final classification result and fed back to the user.

7. The intelligent terminal streaming image recognition method based on hierarchical semantic anchoring according to claim 3, characterized in that: The hierarchical entailment loss in steps 3.4 and 3.5 Loss compared to hyperbolic They are respectively represented as, in, Represented as the cone half of the parent node This indicates the outer angle of a subclass node relative to the boundary of the cone. This is represented as the probability of matching an image to text. This represents the probability of a text-to-image match. and These are the first in the current batch Hyperbolic visual features and textual features of a positive sample image-text pair. For hyperbolic distance, The total number of samples within the batch. Temperature parameters used to control the smoothness of the distribution.

8. The intelligent terminal streaming image recognition method based on hierarchical semantic anchoring according to claim 6, characterized in that: The similarity prediction probability in step 5.2 Sum of negative hyperbolic prediction probabilities They are represented as follows: in, Represented as image features, Represented as the category center, Represented as negative hyperbolic distance. Represented as a hyperbolic feature, This is represented as a hyperbolic feature of the text prototype.

9. An electronic device, characterized in that, include: At least one processor, and a memory communicatively connected to said at least one processor; The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the steps of the method according to any one of claims 1 to 8.

10. A computer storage medium storing a computer program thereon, characterized in that: When the computer program is executed by the processor, it performs the steps as described in any one of claims 1 to 8.