An edge device picture classification method based on a vision-language model
By dividing the task set on edge devices and using external semantic description libraries and multimodal models for lightweight training, the problems of resource constraints and knowledge forgetting in streaming image recognition on mobile devices are solved, achieving efficient recognition and updating of new categories and improving recognition accuracy and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING UNIV
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-26
AI Technical Summary
Mobile devices face challenges in streaming image recognition, including resource constraints, significant forgetting, insufficient semantic information, and high update costs, making it difficult to effectively identify and update new categories in open environments.
By continuously acquiring labeled streaming image data on edge devices, preprocessing it, and dividing it into an initial task set and an incremental task set, lightweight adaptation and fine-tuning training are performed using an external semantic description library and a pre-trained multimodal representation model. Combined with two-stage inference, continuous adaptation and recognition of new categories can be achieved.
Without relying on the replay of historical original images, it reduces the computational power required for updates, improves the ability to learn new image data and categories, enhances recognition accuracy and stability, and solves the problems of knowledge forgetting and insufficient semantic expression on mobile devices in incremental scenarios.
Smart Images

Figure CN122289784A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to streaming data processing and image recognition technology, and more particularly to an image recognition method on mobile devices or edge nodes that utilizes multimodal model representation and external semantic enhancement mechanisms, belonging to the fields of computer vision and continuous learning technology. Background Technology
[0002] With the widespread adoption of mobile devices and changes in content production methods, smartphones, tablets, and other mobile devices have become core carriers for people to record their lives and obtain information. Actions such as casually snapping photos of scenery, scanning QR codes for payments, and uploading product images are ubiquitous, generating a continuous influx of image data, forming typical streaming data. This streaming data is characterized by its continuous arrival. Every second, mobile devices worldwide generate massive amounts of images, from users sharing moments on social media platforms to merchants uploading product display images on e-commerce platforms, and to real-time footage captured by surveillance equipment. This constant generation of data keeps mobile recognition systems on standby, requiring real-time processing capabilities. Secondly, it exhibits category expansion. In the past, mobile recognition might have mainly focused on common categories such as people, animals, and plants. Now, niche handicrafts, specialized industrial parts, and rare plant and animal species have all become recognition targets, placing extremely high demands on the model's knowledge reserves and learning capabilities. Thirdly, it exhibits distribution drift characteristics. The same type of object can exhibit significant differences under different environments. For example, the color, brightness, and clarity of the same product photographed under indoor lighting, outdoor sunlight, and dim nighttime lighting will vary greatly. Similarly, photos of the same person taken with different makeup, clothing, and angles will also be drastically different. This dynamic change in data distribution makes it difficult to stabilize the recognition standards of the model.
[0003] However, mobile devices inherently have many limitations. In terms of computing power, mobile processors cannot compare to cloud servers, and complex recognition algorithms will experience noticeable lag when running on mobile devices. Regarding memory, storing large amounts of image data and model parameters will quickly deplete the device's limited memory space. Power consumption is also a significant concern; continuous recognition calculations will rapidly drain the device's battery, affecting normal user experience. Network conditions are another major constraint; in remote areas with weak signals or underground locations, data transmission back to the cloud is extremely difficult. Traditional high-precision recognition methods often rely on large-scale offline training or frequent cloud transmissions, resulting in high inference latency, high resource consumption, and high privacy and bandwidth costs. In recent years, multimodal or visual pre-trained models have shown outstanding performance in general visual recognition, exhibiting good transfer capabilities. However, when recognition systems need to continuously receive new categories and perform incremental updates in open environments, problems such as limited updates on mobile devices, insufficient semantic expression, and the inability to update incrementally persist remain.
[0004] Therefore, there is an urgent need for an image recognition solution that can operate under the resource constraints of mobile devices, does not require training from scratch, can reduce forgetting in incremental scenarios, and also takes into account generalization ability. Summary of the Invention
[0005] The summary section of this application is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.
[0006] To address the problems of significant forgetting, insufficient semantic information, and high update costs in existing mobile streaming image recognition technologies, this invention proposes a mobile device image recognition method for streaming data. This method includes streaming data collection, external semantic construction, incremental training, and inference calibration. This invention achieves continuous adaptation to new categories without storing historical original image samples or only storing compressed statistical information, and improves recognition stability and accuracy through external semantic enhancement and inference calibration. This solves the problems mentioned in the background section.
[0007] To achieve the above objectives, the present invention provides the following technical solution: As a first aspect of this application, the present invention discloses an edge device image classification method based on a vision-language model, comprising the following steps: Step 1: Continuously acquire labeled streaming image data from edge devices and perform unified preprocessing, then divide the preprocessed data into initial task sets. With subsequent incremental task sets ; Step 2, based on the initial task set and incremental task set After semantic expansion and filtering of the class name text, an external semantic description library that can be used for training and reasoning is obtained. Step 3: Load the pre-trained multimodal representation model Using publicly available pre-trained parameters Initialize it; Step 4, first in the initial task set The lightweight adaptation parameters are trained to establish a category classification head or category representation structure, and then based on the incremental task set... Fine-tune and update the online model; Step 5: Obtain the image to be classified, calculate its image embedding features, and obtain the final prediction probability through two-stage inference based on the accumulated task-level adaptation parameters. .
[0008] Step 6, based on the final predicted probability Output the final classification results and provide feedback to the user.
[0009] Preferably, in step 1, image data provided by the user and its corresponding category labels are first collected on the edge device to form initial samples, which are then preprocessed to obtain the initial task set. Then, in open scenes, image data of newly emerging categories and their corresponding category labels are continuously collected to form incremental samples. The preprocessed data serves as the incremental task set. The preprocessing includes image resizing, pixel normalization, and one or more image enhancement operations such as random cropping, color perturbation, and geometric transformation.
[0010] Preferably, step 2 further includes the following steps: Step 2.1, Load the initial task set. and incremental task set The set of class name text or category phrases corresponding to the category labels; Step 2.2: Semantically expand the class name text based on external knowledge sources to obtain a semantic description set for each category. ; Step 2.3, for the semantic description set The process involves filtering and standardizing the data to create a semantic description library that can be used for matching or training, and retaining at least one or more descriptions for each category.
[0011] Preferably, step 4 further includes the following steps: Step 4.1, in the initial task set The initial stage parameter set is obtained by training lightweight adaptation parameters. And establish category classification heads or category prototype representations available in the initial stage; Step 4.2, for the incremental task set Each incremental task Initialize the corresponding task-level semantic mapping unit and view Figure One The consistency mapping unit forms the task-level adaptation parameters. And freeze the adaptation parameters for all historical tasks.
[0012] Step 4.3, then in the current incremental task set The above multimodal representation model Fine-tune the training based on the training objectives; Step 4.4: Construct compressed replay information of historical categories and combine it with current task data to participate in the training objective of step 4.3 to suppress forgetting; Step 4.5: Output and save the accumulated task-level adaptation parameters for the current task. And update the category classification head or category representation structure that can be used for reasoning.
[0013] Preferably, step 5 further includes the following steps: Step 5.1: After preprocessing the image to be classified, input the image encoding branch and calculate the basic image embedding features; Step 5.2: Based on the accumulated task-level adaptation parameters, match the image embedding features with the text embedding features of each category to obtain the initial prediction probability. ; Step 5.3, from the initial predicted probability Before the election The candidate categories constitute the candidate set. ; Step 5.4, for the candidate set The categories in the text refer to external knowledge sources to generate or retrieve semantic descriptions that distinguish between categories; Step 5.5: Calculate the calibration score of the candidate category based on the discriminative semantic description. ; Step 5.6, calculate the initial predicted probability. With the calibration score By fusing the data, the final predicted probability is obtained. .
[0014] Preferably, in step 4.3, the training objective includes at least one or a combination of the following: Semantic mapping units are used to map the textual features of template class name text, so that they remain similar to the descriptive features of the corresponding category in the semantic description library within the embedding space; Using visual Figure One The consistency mapping unit applies consistency constraints to different enhanced view features of the same image; Cross-modal matching loss is calculated based on the mapped image features and mapped text features to complete the classification learning for the current task.
[0015] Preferably, in step 5.2, for the image to be classified Belongs to the The initial predicted probabilities for each category Defined as: ; in, This indicates that the image to be classified has undergone image coding branching and cumulative task-level visual processing. Figure One Visual embedding features after processing by the consistency mapping unit Indicates the first Text embedding features of text-like data after processing by text encoding branches and cumulative task-level semantic mapping units. Represents the cosine similarity function. Indicates the temperature coefficient. This represents the total number of all categories that have been seen in the current stage.
[0016] Preferably, in step 5.5, for candidate categories The calibration score Defined as: ; in, This represents the image to be classified. Indicates the number of categories in the candidate set. Indicates the use of categories With category Distinguishing semantic description, This represents the matching score between the image to be classified and the discriminative semantic description.
[0017] As a second aspect of this application, the present invention also discloses an electronic device, comprising: At least one processor, and a memory communicatively connected to said at least one processor; The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the steps described above in the edge device image classification method based on a vision-language model.
[0018] As a third aspect of this application, the present invention also discloses a computer storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the steps described above in the edge device image classification method based on a vision-language model.
[0019] Compared with the prior art, the beneficial effects of the present invention are as follows: This invention discloses an image classification method for edge devices based on a vision-language model. First, labeled streaming image data is continuously acquired from edge devices and preprocessed uniformly. The preprocessed data is then divided into an initial task set and subsequent incremental task sets. Based on the initial and incremental task sets, the class name text is semantically expanded and filtered and normalized to obtain an external semantic description library that can be used for training and inference. A pre-trained multimodal representation model is loaded and initialized using publicly available pre-trained parameters. Lightweight adaptation parameters are first trained on the initial task set to establish a category classification head or category representation structure. Then, the online model is fine-tuned and updated based on the incremental task sets. The image to be classified is acquired, and its image embedding features are calculated. Based on the accumulated task-level adaptation parameters, a two-stage inference process is performed to obtain the final predicted probability. The final classification result is output based on the final predicted probability and fed back to the user. Compared with existing technologies, the image classification method based on a vision-image model provided by this invention can reduce update computational power and improve the ability to learn new image data and categories without relying on the replay of historical original images. This method addresses the problems of existing methods struggling to fully utilize linguistic contextual information in incremental learning and being prone to knowledge degradation during continuous learning. While maintaining low computational and storage overhead, it improves the model's recognition accuracy, stability, and generalization ability in dynamic category environments. Attached Figure Description
[0020] The accompanying drawings, which form part of this application, are used to provide a further understanding of the application and to make other features, objects, and advantages of the application more apparent. The illustrative embodiments and descriptions of this application are used to explain the application and do not constitute an undue limitation of the application.
[0021] In the attached diagram: Figure 1 This is a flowchart illustrating the main steps of the edge device image classification method based on a vision-language model in an embodiment of the present invention. Figure 2 This is a flowchart illustrating the steps for obtaining the initial task set and incremental task set in the edge device image classification method based on the vision-language model in this embodiment of the invention. Figure 3 This is a flowchart illustrating the standardized steps within the edge device image classification method based on a vision-language model in this embodiment of the invention. Figure 4 This is a flowchart illustrating the steps involved in training the model within the edge device image classification method based on a vision-language model, as described in this embodiment of the invention. Figure 5 This is a flowchart of the final probability prediction step in the edge device image classification method based on the vision-language model in an embodiment of the present invention. Detailed Implementation
[0022] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0023] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.
[0024] Example This invention discloses an edge device image classification method based on a vision-language model. The following will describe this disclosure in detail with reference to the accompanying drawings and embodiments. (Refer to...) Figure 1 As shown, the present invention mainly includes the following steps: Step 1: Continuously acquire labeled streaming image data from edge devices and perform unified preprocessing, then divide the preprocessed data into initial task sets. With subsequent incremental task sets ; Step 2, based on the initial task set and incremental task set After semantic expansion and filtering of the class name text, an external semantic description library that can be used for training and reasoning is obtained. Step 3: Load the pre-trained multimodal representation model Using publicly available pre-trained parameters Initialize it; Step 4, first in the initial task set The lightweight adaptation parameters are trained to build a class classifier head or class representation structure, and then based on the incremental task set. Fine-tune and update the online model; Step 5: Obtain the image to be classified, calculate its image embedding features, and obtain the final prediction probability through two-stage inference based on the accumulated task-level adaptation parameters. .
[0025] Step 6, based on the final predicted probability Output the final classification results and provide feedback to the user.
[0026] Specifically, refer to Figure 2 As shown. For step 1, labeled streaming image data is continuously acquired from edge devices and uniformly preprocessed. The preprocessed data is then divided into an initial task set. With subsequent incremental task sets First, image data provided by users needs to be collected on edge devices to form initial samples. These initial samples also include category labels corresponding to the image data. Edge devices refer to hardware devices located at the network's edge, close to the data source, including smartphones, tablets, smart cameras, webcams, wearable devices, etc. Specifically, image data directly provided or generated by users, such as photographs or recorded facial images, is collected on edge devices like smartphones or smart cameras. While collecting these images, each image is assigned a corresponding category label through actions such as manual naming, QR code scanning, and scene confirmation. Data augmentation methods are applied to the collected initial samples as preprocessing, and a classification template is generated for each category of image, such as "a class of [CLASS]". The processed image data is considered the initial task set. Next, image data of newly emerging categories and their corresponding category labels are continuously collected in open scenes to form incremental samples. The image data within the incremental samples is then preprocessed, and the preprocessed data serves as the incremental task set. Preprocessing of initial and incremental samples includes image resizing, pixel normalization, and one or more image enhancement operations such as random cropping, color perturbation, and geometric transformation.
[0027] Regarding step 2, based on the initial task set and incremental task set After semantically expanding and filtering the class name text, an external semantic description library that can be used for training and inference is obtained. This also includes the following steps: Step 2.1, Load the initial task set and incremental task set The set of class name text or category phrases corresponding to the category labels; Step 2.2: Semantically expand the class name text based on external knowledge sources to obtain a semantic description set for each category. ; Step 2.3 involves filtering and standardizing the semantic description set to form a semantic description library that can be used for matching or training, and retaining at least one or more descriptions for each category.
[0028] Specifically, refer to Figure 3 As shown. Load the initial task set. and incremental task set This is a set of class name texts or category phrases corresponding to the category labels. Based on external knowledge sources, the class name texts are semantically expanded to obtain a set of semantic descriptions for each category. The external knowledge source is an offline-built category attribute dictionary, including at least one or a combination of a pre-built attribute dictionary, a web-based retrieval summary library, or a text generation model. The semantic description set is filtered and normalized to form a semantic description library that can be used for matching or training, retaining at least one or more descriptions for each category. Filtering includes one of the following rules: deleting descriptions irrelevant to the visual recognition of the current category, deleting duplicate or semantically similar descriptions, and deleting conflicting descriptions. Normalization includes at least one of the following processes: standardizing capitalization, punctuation and sentence format, normalizing synonyms, deleting stop words, splitting long sentences into attribute phrases, and structuring descriptions according to preset attribute types. After filtering and normalization, at least one or more normalized descriptions are retained for each category.
[0029] Reference Figure 4 As shown, for step 3, the pre-trained multimodal representation model is loaded. Using publicly available pre-trained parameters Initialize it. Specifically, the multimodal representation model. This is a visual-language model pre-trained on large-scale image-text pairing data via cross-modal contrastive learning, comprising at least an image encoding branch and a text encoding branch. The pre-training parameters are publicly available. This is the set of parameters for the visual-language model pre-trained based on large-scale image-text pairing data before the execution of this invention. First, it requests system resources to load the pre-trained multimodal representation model. Then, publicly available pre-trained parameters are used. Multimodal characterization model Initialization is performed, during which the trunk parameters of the image encoding branch and the text encoding branch must be frozen.
[0030] Regarding step 4, first in the initial task set The lightweight adaptation parameters are trained to build a class classifier head or class representation structure, and then based on the incremental task set. Fine-tuning and updating the online model involves the following steps: Step 4.1, in the initial task set The initial stage parameter set is obtained by training lightweight adaptation parameters. And establish category classification heads or category prototype representations available in the initial stage; Step 4.2, for the incremental task set Each incremental task Initialize the corresponding task-level semantic mapping unit and view Figure One The consistency mapping unit forms the task-level adaptation parameters. And freeze the adaptation parameters for all historical tasks.
[0031] Step 4.3, then in the current incremental task set The above multimodal representation model Fine-tune the training based on the training objectives; Step 4.4: Construct compressed replay information of historical categories and combine it with current task data to participate in the training objective of Step 4.2 to suppress forgetting; Step 4.5: Output and save the accumulated task-level adaptation parameters for the current task. And update the category classification head or category representation structure that can be used for reasoning.
[0032] Specifically, in the initial task set The initial stage parameter set is obtained by training lightweight adaptation parameters. This involves establishing initial category classification heads or category prototype representations. Since the backbone parameters of the image encoding and text encoding branches are frozen during training, only the additional lightweight adaptation parameters are updated. The initial task set is then used. Image samples are input into the image encoding branch to obtain image features, while template text or semantic descriptions of the corresponding categories are input into the text encoding branch to obtain text features. Cross-modal matching loss is calculated based on the adapted image and text features for optimization. After training, a category classification head or category prototype representation is built based on the embedding features of samples from each category. (The data is derived from the incremental task set.) Each incremental task in Initialize the corresponding task-level semantic mapping unit and view Figure One The consistency mapping unit forms the adaptation parameters for this task. And freeze the adaptation parameters of all historical tasks. The task-level semantic mapping unit is a lightweight mapping module attached to the text encoding branch, used to map the template class name text features to a representation space consistent with the corresponding semantic description features. Figure One The consistency mapping unit is a lightweight mapping module attached to the image coding branch, used to map the original image features to a representation space that is consistent with the enhanced view features.
[0033] Then in the current incremental task set The above multimodal representation model Fine-tuning training is performed based on the training objectives. These objectives must include at least one or a combination of the following: Semantic mapping units are used to map the textual features of template class name text, so that they remain similar to the descriptive features of the corresponding category in the semantic description library within the embedding space; Using visual Figure One The consistency mapping unit applies consistency constraints to different enhanced view features of the same image; Cross-modal matching loss is calculated based on the mapped image features and mapped text features to complete the classification learning for the current task.
[0034] The cross-modal matching loss is constructed using the cosine similarity between the mapped image features and the mapped text features. The classification probability is calculated, and the cross-entropy loss is computed based on the true labels. Next, compressed replay information of historical categories is constructed and used in conjunction with the current task data to participate in the training objective of step 4.2 to suppress forgetting. The compressed replay information consists of the statistical feature vectors of historical categories in the embedding space or their perturbed form. Finally, the task-level adaptation parameters for the current task are output and saved. And update the category classification head or category representation structure that can be used for reasoning.
[0035] Reference Figure 5 As shown, for step 5, which involves acquiring the image to be classified, calculating its image embedding features, and obtaining the final prediction probability through two-stage inference based on the accumulated task-level adaptation parameters, the specific steps also include: Step 5.1: After preprocessing the image to be classified, input the image encoding branch and calculate the basic image embedding features; Step 5.2: Based on the accumulated task-level adaptation parameters, match the image embedding features with the text embedding features of each category to obtain the initial prediction probability. ; Step 5.3, from the initial predicted probability Before the election The candidate categories constitute the candidate set. ; Step 5.4, for the candidate set The categories in the text refer to external knowledge sources to generate or retrieve semantic descriptions that distinguish between categories; Step 5.5: Calculate the calibration score of the candidate categories based on the discriminative semantic description. ; Step 5.6, calculate the initial prediction probability. With calibration score By fusing the data, the final predicted probability is obtained. .
[0036] Specifically, to obtain the image to be classified After preprocessing, the image is input into the image encoding branch to calculate the image embedding features. Image embedding features , This represents the frozen image encoding branch. Then, based on the accumulated task-level adaptation parameters, the image embedding features are matched with the text embedding features of each category to obtain the initial prediction probability. For the image to be classified Belongs to the Initial predicted probabilities for each category Defined as: ; in, This indicates that the image to be classified has undergone image coding branching and cumulative task-level visual processing. Figure One Visual embedding features after processing by the consistency mapping unit Indicates the first Text embedding features of text-like data after processing by text encoding branches and cumulative task-level semantic mapping units. cos (·,·) denotes the cosine similarity function. τ Indicates the temperature coefficient. This represents the total number of all categories seen in the current stage. The probabilities of all categories together constitute the initial predicted probability. .
[0037] From the initial predicted probability Before the election The candidate categories constitute the candidate set. For candidate set The categories in the algorithm call external knowledge sources to generate or retrieve discriminative semantic descriptions between categories, and calculate calibration scores for candidate categories based on these discriminative semantic descriptions. For candidate category pairs Categories can be generated or retrieved using text generation models, attribute dictionaries, or difference description knowledge bases. Relative to category Distinguishing semantic description This forms a pairwise description set corresponding to the candidate set. For candidate categories Its calibration score Defined as: ; in, This represents the image to be classified. Indicates the number of categories in the candidate set. Indicates the use of categories With category Distinguishing semantic description, This represents the matching score between the image to be classified and the discriminative semantic description. Specifically, the matching score is obtained by calculating the cosine similarity between the image embedding and the description embedding using the image encoding branch and the text encoding branch, and is expressed as: ; For the initial prediction probability With calibration score By fusing the data, the final predicted probability is obtained. Output the final classification results and provide feedback to the user.
[0038] To implement the above embodiments, this application also discloses an electronic device. The electronic device may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) that can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) or a program loaded from a storage device into a random access memory (RAM). Various programs and data required for the operation of the electronic device are also stored in the RAM. The processing unit, ROM, and RAM are interconnected via a bus. An input / output (I / O) interface is also connected to the bus. Typically, the following devices can be connected to the I / O interface: input devices including, for example, a touchscreen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices including, for example, a liquid crystal display (LCD), speaker, vibrator, etc.; storage devices including, for example, magnetic tape, hard disk, etc.; and communication devices. The communication device allows the electronic device to communicate wirelessly or wiredly with other devices to exchange data. Although electronic devices with various devices are shown, it should be understood that it is not required to implement or possess all of the shown devices. More or fewer devices may be implemented or possessed alternatively.
[0039] In particular, according to some embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, some embodiments of this disclosure include a computer program product comprising a computer program carried on a computer storage medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from a storage device, or installed from a ROM. When the computer program is executed by a processing device, it performs the functions defined above in the methods of some embodiments of this disclosure.
[0040] It should be noted that the computer storage medium in some embodiments of this disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.
[0041] In some embodiments of this disclosure, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in connection with an instruction execution system, apparatus, or device. In some embodiments of this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can also be any computer storage medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer storage medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0042] In other implementations, clients and servers may communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol), and may interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future-developed networks.
[0043] The aforementioned computer storage medium may be included within the aforementioned electronic device, or it may exist independently and not assembled into the electronic device. The aforementioned computer storage medium carries one or more programs that, when executed by the electronic device, enable the electronic device to implement an edge device image classification method based on a vision-language model.
[0044] Computer program code for performing operations of some embodiments of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0045] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. Each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions. Units described in some embodiments of the present disclosure may be implemented in software or hardware. The described units may also be located in a processor, and the names of these units do not necessarily constitute a limitation on the unit itself.
[0046] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.
[0047] All technologies not described in detail in this invention are existing technologies. The above descriptions are merely some preferred embodiments of this disclosure and explanations of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalent features without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.
Claims
1. A method for classifying edge device images based on a vision-language model, characterized in that, Includes the following steps: Step 1: Continuously acquire labeled streaming image data from edge devices and perform unified preprocessing, then divide the preprocessed data into initial task sets. With subsequent incremental task sets ; Step 2, based on the initial task set and incremental task set After semantic expansion and filtering of the class name text, an external semantic description library that can be used for training and reasoning is obtained. Step 3: Load the pre-trained multimodal representation model Using publicly available pre-trained parameters Initialize it; Step 4, first in the initial task set The lightweight adaptation parameters are trained to establish a category classification head or category representation structure, and then based on the incremental task set... Fine-tune and update the online model; Step 5: Obtain the image to be classified, calculate its image embedding features, and obtain the final prediction probability through two-stage inference based on the accumulated task-level adaptation parameters. . Step 6, based on the final predicted probability Output the final classification results and provide feedback to the user.
2. The edge device image classification method based on a vision-language model according to claim 1, characterized in that: In step 1, image data provided by the user and its corresponding category labels are first collected on the edge device to form initial samples, which are then preprocessed to obtain the initial task set. ; Then, image data of newly emerging categories and their corresponding category labels are continuously collected in open scenes to form incremental samples. The preprocessed data serves as the incremental task set. The preprocessing includes image resizing, pixel normalization, and one or more image enhancement operations such as random cropping, color perturbation, and geometric transformation.
3. The edge device image classification method based on a vision-language model according to claim 2, characterized in that, Step 2 also includes the following steps: Step 2.1, Load the initial task set. and incremental task set The set of class name text or category phrases corresponding to the category labels; Step 2.2: Semantically expand the class name text based on external knowledge sources to obtain a semantic description set for each category. ; Step 2.3, for the semantic description set The process involves filtering and standardizing the data to create a semantic description library that can be used for matching or training, and retaining at least one or more descriptions for each category.
4. The edge device image classification method based on a vision-language model according to claim 3, characterized in that: Step 4 also includes the following steps: Step 4.1, in the initial task set The initial stage parameter set is obtained by training lightweight adaptation parameters. And establish category classification heads or category prototype representations available in the initial stage; Step 4.2, for the incremental task set Each incremental task Initialize the corresponding task-level semantic mapping unit and view consistency mapping unit to form task-level adaptation parameters. And freeze the adaptation parameters for all historical tasks. Step 4.3, then in the current incremental task set The above multimodal representation model Fine-tune the training based on the training objectives; Step 4.4: Construct compressed replay information of historical categories and combine it with current task data to participate in the training objective of step 4.3 to suppress forgetting; Step 4.5: Output and save the accumulated task-level adaptation parameters for the current task. And update the category classification head or category representation structure that can be used for reasoning.
5. The edge device image classification method based on a vision-language model according to claim 4, characterized in that, Step 5 also includes the following steps: Step 5.1: After preprocessing the image to be classified, input the image encoding branch and calculate the basic image embedding features; Step 5.2: Based on the accumulated task-level adaptation parameters, match the image embedding features with the text embedding features of each category to obtain the initial prediction probability. ; Step 5.3, from the initial predicted probability Before the election The candidate categories constitute the candidate set. ; Step 5.4, for the candidate set The categories in the text refer to external knowledge sources to generate or retrieve semantic descriptions that distinguish between categories; Step 5.5: Calculate the calibration score of the candidate category based on the discriminative semantic description. ; Step 5.6, calculate the initial predicted probability. With the calibration score By fusing the data, the final predicted probability is obtained. .
6. The edge device image classification method based on a vision-language model according to claim 4, characterized in that: In step 4.3, the training objective includes at least one or a combination of the following: Semantic mapping units are used to map the textual features of template class name text, so that they remain similar to the descriptive features of the corresponding category in the semantic description library within the embedding space; A view consistency mapping unit is used to apply consistency constraints to different enhanced view features of the same image. Cross-modal matching loss is calculated based on the mapped image features and mapped text features to complete the classification learning for the current task.
7. The edge device image classification method based on a vision-language model according to claim 5, characterized in that: In step 5.2, for the image to be classified Belongs to the The initial predicted probabilities for each category Defined as: ; in, This represents the visual embedding features of the image to be classified after processing by the image coding branch and the cumulative task-level view consistency mapping unit. Indicates the first Text embedding features of text-like data after processing by text encoding branches and cumulative task-level semantic mapping units. Represents the cosine similarity function. Indicates the temperature coefficient. This represents the total number of all categories that have been seen in the current stage.
8. The edge device image classification method based on a vision-language model according to claim 5, characterized in that: In step 5.5, for candidate categories The calibration score Defined as: ; in, This represents the image to be classified. Indicates the number of categories in the candidate set. Indicates the use of categories With category Distinguishing semantic description, This represents the matching score between the image to be classified and the discriminative semantic description.
9. An electronic device, characterized in that, include: At least one processor, and a memory communicatively connected to said at least one processor; The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the steps of the method according to any one of claims 1 to 8.
10. A computer storage medium storing a computer program thereon, characterized in that: When the computer program is executed by the processor, it performs the steps as described in any one of claims 1 to 8.